Process_project
Format
cg process_project ?options? projectdir ?dbdir?
Summary
process a sequencing project directory (projectdir), generating full analysis
information (variant calls, multicompar, reports, ...) starting from
raw sample data from various sources.
Description
The cg process_project
command performs the entire secondary analysis (clipping, alignment,
variant calling, reports, ...) and part of the tertiary analysis
(combining samples, annotation, ...) on a number of samples that may
come from various sources. You can specify which analyses should be
run using the options. The default settings/options are for analysing
Illumina genomic sequencing data. You can use the -preset option to
specify default settings for various types of analyses (short or
long-read genome, transcriptome, etc.) A practical example of the
workflow can be found in howto_process_project.
The command expects a basic genomecomb project directory (as
described extensively in projectdir)
containing a number of samples with raw data (fastq, Complete
genomics results, ...). Each sample is in a separate subdirectory of
a directory named samples in the projectdir. You can add
samples manually or using the cg
project_addsample command as described in howto_process_project.
Per sample analysis
In the first step, each sampledir is processed using cg process_sample; Samples in one
project can come from different sources (Complete genomics, illumina
sequencing) and be of different types (shotgun, amplicon). Some
options are applied to all samples, e.g. the -amplicons option (for
amplicon sequencing analysis) will place (a link to) the given
amplicons file in each sampledir. These options should only be used
in projects with uniform samples. For mixed samples, these options
can be applied specifically by placing files, e.g. an amplicon file
(named reg_*_amplicons.tsv) in the appropriate sample directories.
More information on specific sample types and options can be found in
the description of cg
process_sample.
Combined analysis
In the final step process_project will call cg process_multicompar to
combine sample results in the subdirectory compar. Different result
files may be present depending on the type of analysis:
- annot_compar-projectname.tsv
- multicompar file containing information for all variants in
all samples (and all methods). If a variant is not present in one
of the samples, the information at the position of the variant will
be completed (is the position sequenced or not, coverage, ...) The
file is also annotated with all databases in dbdir (impact on
genes, regions of interest, known variant data)
- sreg-projectname.tsv
- sequenced region multicompar file containing for all regions
whether they are sequenced (1) or nor (0) for each sample.
- annot_cgsv-projectname.tsv
- combined results of Complete Genomics structural variant
calling
- annot_cgcnv-projectname.tsv
- combined results of Complete Genomics CNV calling
Arguments
- projectdir
- project directory with illumina data for different samples,
each sample in a sub directory. The proc will search for fastq
files in dir/samplename/fastq/
- dbdir
- directory containing reference data (genome sequence,
annotation, ...). dbdir can also be given in a projectinfo.tsv file
in the project directory. process_project called with the dbdir
parameter will create the projectinfo.tsv file.
Options
This command can be distributed on a cluster or using multiple
with job options (more info with cg
help joboptions)
As different types of original data are processed differently, not
all options are applicable. Options that are not applicable to the
given type of data are ignored.
By default options are set for short read genomic sequencing
(genome/exome/targeted). Presets can be used to set a number of
options to the defaults for a given analysis. These options can still
be changed by specifically giving them a value after the -preset
option (options given later overrule previous ones).
- -preset preset
- sets a number of options to the defaults for the given
"preset", must be one of: srs (short read genomic
sequencing), rseq (short read rna-seq), ont (ont genomic
sequencing), ontr (ont RNA-seq), scywalker (ont 10x single cell
rna-seq), scywalker_pacbio (pacbio 10x single cell rna-seq)
- -samplesheet samplesheet
- create projectdir based on the data in the given
sampleheet as described in cg
make_project
- -dbdir dbdir
- dbdir can also be given as an option (instead of
second parameter)
- -minfastqreads num
- fastq based samples with less than num reads in the
fastq files are not processed and not added to the final compar.
- -clip
- clip adaptor sequences prior to alignment using fastq-mcf
(default 1)
- -paired 1/0 (-p)
- sequenced are paired/unpaired
- -adapterfile file
- Use file for possible adapter sequences
- -removeskew num
- -k parameter for sequence clipping using fastq-mcf: sKew
percentage-less-than causing cycle removal
- -aligners aligner (-a)
- use the given aligner for mapping to the reference genome
(default bwa) Currently supported are: bwa, bowtie2, minimap2_sr,
minimap2, minimap2_pb, minimap2_asm20, ngmlr ; for rna-seq: star,
star_2p, hisat2, minimap2_splice, minimap2_splicehq
- -ali_keepcomments 1/0
- set to 1 to transfer sequence comments in the source fastq or
ubams to the alignment (default don't keep for fastq, keep for
ubams). This option currently only works for minimap2 aligner
- -aliformat format
- format of the (final) alignment (map) files, this is by
default bam, but can be set to cram
- -realign value
- If value is 0, realignment will not be performed, use
1 (default) for realignment with gatk, or value srma for
alignment with srma
- -removeduplicates 0/1/picard/biobambam
- By default duplicates will be removed (marked actually) using
samtools except for amplicon sequencing. With this option you can
specifically request or turn of duplicate removal (overruling the
default). If you want to use large amounts of memory ;-), you can
still use picard for removing duplicates (third option)
- -amplicons ampliconfile
- This option turns on amplicon sequencing analysis (as
described in cg
process_sample) using the amplicons defained in
ampliconfile for all samples that do not have a sample
specific amplicon file yet.
- -varcallers varcallers
- (space separated) list of variant callers to be used (default
"gatkh strelka"). Currently supported are: gatk, gatkh
(haplotype caller), strelka, sam, freebayes, bcf, longshot, clair3,
- -svcallers svcallers
- (space separated) list of structural variant callers to be
used (default empty). Currently supported are: manta, lumpy, gridds
sniffles, cuteSV, npinv
- -methcallers methcallers
- (space separated) list of methylation callers to be used
(default empty). Currently supported are: nanopolish
- -reftranscripts reftranscripts
- file with reference transcripts for isoform calling. (default
empty -> finds default in refdb) Currently supported are: flair,
isoquant, flames
- -isocallers isocallers
- (space separated) list of isofrom calling (and counting)
programs to be used for rna-seq data. Currently supported are:
flair, isoquant, flames
- -organelles organelles
- (space separated) list of chromosomes that are organelles
(that are treated differently in some analysis) If not given
explicitely, the ones indicated in the file
$refdb/extra/reg_*_organelles.tsv (if present) will be used
- -counters counters
- (space separated) list of counter programs to be used for
rna-seq data. Currently supported are: rnaseqc, qorts
- -split 1/0
- split multiple alternative genotypes over different line
- -downsampling_type NONE/ALL_READS/BY_SAMPLE/
- sets the downsampling type used by GATK (empty for default).
- -reports list
- use basic (default) for creating most reports, or all for all
reports. If you only want some made, give these as a space
separated list. Possible reports are (further explained in cg process_reports): fastqstats
fastqc flagstat_reads flagstat_alignments samstats alignedsamstats
unalignedsamstats histodepth vars hsmetrics covered histo
predictgender
- -m maxopenfiles (-maxopenfiles)
- The number of files that a program can keep open at the same
time is limited. pmulticompar will distribute the subtasks thus,
that the number of files open at the same time stays below this
number. With this option, the maximum number of open files can be
set manually (if the program e.g. does not deduce the proper limit,
or you want to affect the distribution).
- -samBQ number
- only for samtools; minimum base quality for a base to be
considered (samtools --min-BQ option)
- -distrreg regions
- distribute regions for parallel processing. Possible options
are 0: no distribution (also empty) 1: default
distribution schr or schromosome: each chromosome processed
separately chr or chromosome: each chromosome processed
separately, except the unsorted, etc. with a _ in the name that
will be combined), a number: distribution into regions of this
size a number preceded by an s: distribution into regions
targeting the given size, but breaks can only occur in unsequenced
regions of the genome (N stretches) a number preceded by an r:
distribution into regions targeting the given size, but breaks can
only occur in large (>=100000 bases) repeat regions a
number preceded by an g: distribution into regions targeting the
given size, but breaks can only occur in large (>=200000 bases)
regions without known genes a file name: the regions in the
file will be used for distribution
- -maxfastqdistr maxfastqdistr
- if there are more than maxfastqdistr separate input
fastqs, they will be merged into maxfastqdistr fastqs for
analysis: If there are many (small) fastqs, the overhead to
processes (alignment etc.) them separately (default, to distribute
the load) can become too large.
- -datatype datatype
- Some variant callers (strelka) need to know the type of data
(genome, exome or amplicons) for analysis. You can specify it using
this option. If not given, it is deduced from acompanying region
files (reg_*_amplicons.tsv for ampicons or reg_*_amplicons.tsv for
exome)
- -hap_bam 0/1
- if 1 produce a bam file with haplotype indictions (longshot
only) (default 0)
- -depth_histo_max number
- in reports, count positions with up to number depth
(default 1000). Larger dfepths will be counted under number
- -targetfile targetfile
- if targetfile is provided, coverage statistics will be
calculated for this region
- -targetvarsfile file
- Use this option to easily check certain target
positions/variants in the multicompar. The variants in file
will allways be added in the final multicompar file, even if none
of the samples is variant (or even sequenced) in it.
- -dbfile file
- Use the given file for extra (files in dbdir
are already used) annotation. This option can be given more than
once; all given files will be added
- -dbfiles files
- Use files for extra (files in dbdir are already
used) annotation. files should be a space separated list of
files.
- -conv_nextseq 1/0
- generate fastqs for nextseq run & create sample folders -
rundir should be placed in projectdir of resulting variants. This
option can be added multiple times (with different files)
- -jobsample integer
- By default (0) the processing of each sample is split in many
separate jobs. If you have to process many samples with relatively
short individual runtimes or your cluster limits the number of jobs
you can set this to 1 or more to run each sample in only one job,
thus reducing the job managment overhead. The number given is the
number of cores assigned to each such job.
- -keepfields fieldlist
- Besides the obligatory fields, include only the fields in
fieldlist (space separated) in the multicompar file. Default is to
use all fields present in the file (*). All fields will still be
used in the per sample output.
This command can be distributed on a cluster or using multiple
cores with job options (more info with
cg help joboptions) The option -distrreg can be used to allow a
greater distribution by doing some analyses (variantcalling,
annotations) split by region (chromosomes) and combining the results
Sample specific options
Different options can be given to different samples within the
same experiment run by storing a file named options.tsv in the
experiment/project dir with the following fields: sample option value
For each sample (that differs from the general option if given)
you add a line with the samplename, the option (without the -, e.g.
sc_expectedcells) and the value (the number of expected cells in the
case of sc_expectedcells). Sample specific options given this way
overrule the general options given on the process_project commandline
(for that sample)
You can also use the preset option this way, allowing the analysis
of different technologies (e.g. ont and srs) in one run. Beware that
presets just change base/default settings; Options explicitely given
on the process_project commandline will overrule settings from a
preset, even if sample specific).
Dependencies
Some of the programs used in this workflow are not distributed
with genomecomb itself (e.g. gatk, strelka) and should be installed
separately. To make this easier, they are available as portable
application directories from the [genomecomb
website](https://derijkp.github.io/genomecomb/install.html) or can be
directly installed using the cg install
command.
Example
cg process_project -d sge testproject /complgen/refseq/hg19
Category
Process