GenomeComb

Process_project

Format

cg process_project ?options? projectdir ?dbdir?

Summary

process a sequencing project directory (projectdir), generating full analysis information (variant calls, multicompar, reports, ...) starting from raw sample data from various sources.

Description

The cg process_project command performs the entire secondary analysis (clipping, alignment, variant calling, reports, ...) and part of the tertiary analysis (combining samples, annotation, ...) on a number of samples that may come from various sources. You can specify which analyses should be run using the options. The default settings/options are for analysing Illumina genomic sequencing data. You can use the -preset option to specify default settings for various types of analyses (short or long-read genome, transcriptome, etc.) A practical example of the workflow can be found in howto_process_project.

The command expects a basic genomecomb project directory (as described extensively in projectdir) containing a number of samples with raw data (fastq, Complete genomics results, ...). Each sample is in a separate subdirectory of a directory named samples in the projectdir. You can add samples manually or using the cg project_addsample command as described in howto_process_project.

Per sample analysis

In the first step, each sampledir is processed using cg process_sample; Samples in one project can come from different sources (Complete genomics, illumina sequencing) and be of different types (shotgun, amplicon). Some options are applied to all samples, e.g. the -amplicons option (for amplicon sequencing analysis) will place (a link to) the given amplicons file in each sampledir. These options should only be used in projects with uniform samples. For mixed samples, these options can be applied specifically by placing files, e.g. an amplicon file (named reg_*_amplicons.tsv) in the appropriate sample directories. More information on specific sample types and options can be found in the description of cg process_sample.

Combined analysis

In the final step process_project will call cg process_multicompar to combine sample results in the subdirectory compar. Different result files may be present depending on the type of analysis:

annot_compar-projectname.tsv
multicompar file containing information for all variants in all samples (and all methods). If a variant is not present in one of the samples, the information at the position of the variant will be completed (is the position sequenced or not, coverage, ...) The file is also annotated with all databases in dbdir (impact on genes, regions of interest, known variant data)
sreg-projectname.tsv
sequenced region multicompar file containing for all regions whether they are sequenced (1) or nor (0) for each sample.
annot_cgsv-projectname.tsv
combined results of Complete Genomics structural variant calling
annot_cgcnv-projectname.tsv
combined results of Complete Genomics CNV calling

Arguments

projectdir
project directory with illumina data for different samples, each sample in a sub directory. The proc will search for fastq files in dir/samplename/fastq/
dbdir
directory containing reference data (genome sequence, annotation, ...). dbdir can also be given in a projectinfo.tsv file in the project directory. process_project called with the dbdir parameter will create the projectinfo.tsv file.

Options

This command can be distributed on a cluster or using multiple with job options (more info with cg help joboptions)

As different types of original data are processed differently, not all options are applicable. Options that are not applicable to the given type of data are ignored.

By default options are set for short read genomic sequencing (genome/exome/targeted). Presets can be used to set a number of options to the defaults for a given analysis. These options can still be changed by specifically giving them a value after the -preset option (options given later overrule previous ones).

-preset preset
sets a number of options to the defaults for the given "preset", must be one of: srs (short read genomic sequencing), rseq (short read rna-seq), ont (ont genomic sequencing), ontr (ont RNA-seq), scywalker (ont 10x single cell rna-seq), scywalker_pacbio (pacbio 10x single cell rna-seq)
-samplesheet samplesheet
create projectdir based on the data in the given sampleheet as described in cg make_project
-dbdir dbdir
dbdir can also be given as an option (instead of second parameter)
-minfastqreads num
fastq based samples with less than num reads in the fastq files are not processed and not added to the final compar.
-clip
clip adaptor sequences prior to alignment using fastq-mcf (default 1)
-paired 1/0 (-p)
sequenced are paired/unpaired
-adapterfile file
Use file for possible adapter sequences
-removeskew num
-k parameter for sequence clipping using fastq-mcf: sKew percentage-less-than causing cycle removal
-aligners aligner (-a)
use the given aligner for mapping to the reference genome (default bwa) Currently supported are: bwa, bowtie2, minimap2_sr, minimap2, minimap2_pb, minimap2_asm20, ngmlr ; for rna-seq: star, star_2p, hisat2, minimap2_splice, minimap2_splicehq
-ali_keepcomments 1/0
set to 1 to transfer sequence comments in the source fastq or ubams to the alignment (default don't keep for fastq, keep for ubams). This option currently only works for minimap2 aligner
-aliformat format
format of the (final) alignment (map) files, this is by default bam, but can be set to cram
-realign value
If value is 0, realignment will not be performed, use 1 (default) for realignment with gatk, or value srma for alignment with srma
-removeduplicates 0/1/picard/biobambam
By default duplicates will be removed (marked actually) using samtools except for amplicon sequencing. With this option you can specifically request or turn of duplicate removal (overruling the default). If you want to use large amounts of memory ;-), you can still use picard for removing duplicates (third option)
-amplicons ampliconfile
This option turns on amplicon sequencing analysis (as described in cg process_sample) using the amplicons defained in ampliconfile for all samples that do not have a sample specific amplicon file yet.
-varcallers varcallers
(space separated) list of variant callers to be used (default "gatkh strelka"). Currently supported are: gatk, gatkh (haplotype caller), strelka, sam, freebayes, bcf, longshot, clair3,
-svcallers svcallers
(space separated) list of structural variant callers to be used (default empty). Currently supported are: manta, lumpy, gridds sniffles, cuteSV, npinv
-methcallers methcallers
(space separated) list of methylation callers to be used (default empty). Currently supported are: nanopolish
-reftranscripts reftranscripts
file with reference transcripts for isoform calling. (default empty -> finds default in refdb) Currently supported are: flair, isoquant, flames
-isocallers isocallers
(space separated) list of isofrom calling (and counting) programs to be used for rna-seq data. Currently supported are: flair, isoquant, flames
-organelles organelles
(space separated) list of chromosomes that are organelles (that are treated differently in some analysis) If not given explicitely, the ones indicated in the file $refdb/extra/reg_*_organelles.tsv (if present) will be used
-counters counters
(space separated) list of counter programs to be used for rna-seq data. Currently supported are: rnaseqc, qorts
-split 1/0
split multiple alternative genotypes over different line
-downsampling_type NONE/ALL_READS/BY_SAMPLE/
sets the downsampling type used by GATK (empty for default).
-reports list
use basic (default) for creating most reports, or all for all reports. If you only want some made, give these as a space separated list. Possible reports are (further explained in cg process_reports): fastqstats fastqc flagstat_reads flagstat_alignments samstats alignedsamstats unalignedsamstats histodepth vars hsmetrics covered histo predictgender
-m maxopenfiles (-maxopenfiles)
The number of files that a program can keep open at the same time is limited. pmulticompar will distribute the subtasks thus, that the number of files open at the same time stays below this number. With this option, the maximum number of open files can be set manually (if the program e.g. does not deduce the proper limit, or you want to affect the distribution).
-samBQ number
only for samtools; minimum base quality for a base to be considered (samtools --min-BQ option)
-distrreg regions
distribute regions for parallel processing. Possible options are 0: no distribution (also empty) 1: default distribution schr or schromosome: each chromosome processed separately chr or chromosome: each chromosome processed separately, except the unsorted, etc. with a _ in the name that will be combined), a number: distribution into regions of this size a number preceded by an s: distribution into regions targeting the given size, but breaks can only occur in unsequenced regions of the genome (N stretches) a number preceded by an r: distribution into regions targeting the given size, but breaks can only occur in large (>=100000 bases) repeat regions a number preceded by an g: distribution into regions targeting the given size, but breaks can only occur in large (>=200000 bases) regions without known genes a file name: the regions in the file will be used for distribution
-maxfastqdistr maxfastqdistr
if there are more than maxfastqdistr separate input fastqs, they will be merged into maxfastqdistr fastqs for analysis: If there are many (small) fastqs, the overhead to processes (alignment etc.) them separately (default, to distribute the load) can become too large.
-datatype datatype
Some variant callers (strelka) need to know the type of data (genome, exome or amplicons) for analysis. You can specify it using this option. If not given, it is deduced from acompanying region files (reg_*_amplicons.tsv for ampicons or reg_*_amplicons.tsv for exome)
-hap_bam 0/1
if 1 produce a bam file with haplotype indictions (longshot only) (default 0)
-depth_histo_max number
in reports, count positions with up to number depth (default 1000). Larger dfepths will be counted under number
-targetfile targetfile
if targetfile is provided, coverage statistics will be calculated for this region
-targetvarsfile file
Use this option to easily check certain target positions/variants in the multicompar. The variants in file will allways be added in the final multicompar file, even if none of the samples is variant (or even sequenced) in it.
-dbfile file
Use the given file for extra (files in dbdir are already used) annotation. This option can be given more than once; all given files will be added
-dbfiles files
Use files for extra (files in dbdir are already used) annotation. files should be a space separated list of files.
-conv_nextseq 1/0
generate fastqs for nextseq run & create sample folders - rundir should be placed in projectdir of resulting variants. This option can be added multiple times (with different files)
-jobsample integer
By default (0) the processing of each sample is split in many separate jobs. If you have to process many samples with relatively short individual runtimes or your cluster limits the number of jobs you can set this to 1 or more to run each sample in only one job, thus reducing the job managment overhead. The number given is the number of cores assigned to each such job.
-keepfields fieldlist
Besides the obligatory fields, include only the fields in fieldlist (space separated) in the multicompar file. Default is to use all fields present in the file (*). All fields will still be used in the per sample output.

This command can be distributed on a cluster or using multiple cores with job options (more info with cg help joboptions) The option -distrreg can be used to allow a greater distribution by doing some analyses (variantcalling, annotations) split by region (chromosomes) and combining the results

Sample specific options

Different options can be given to different samples within the same experiment run by storing a file named options.tsv in the experiment/project dir with the following fields: sample option value

For each sample (that differs from the general option if given) you add a line with the samplename, the option (without the -, e.g. sc_expectedcells) and the value (the number of expected cells in the case of sc_expectedcells). Sample specific options given this way overrule the general options given on the process_project commandline (for that sample)

You can also use the preset option this way, allowing the analysis of different technologies (e.g. ont and srs) in one run. Beware that presets just change base/default settings; Options explicitely given on the process_project commandline will overrule settings from a preset, even if sample specific).

Dependencies

Some of the programs used in this workflow are not distributed with genomecomb itself (e.g. gatk, strelka) and should be installed separately. To make this easier, they are available as portable application directories from the [genomecomb website](https://derijkp.github.io/genomecomb/install.html) or can be directly installed using the cg install command.

Example

cg process_project -d sge testproject /complgen/refseq/hg19

Category

Process