Process

Format

cg process_sample ?options? ?oridir? sampledir

Summary

Processes one sample directory (projectdir), generating full analysis information (variant calls, multicompar, reports, ...) starting from raw sample data that can come from various sources.

Description

The command expects a basic genomecomb sample directory (as described extensively in projectdir) It performs a generic ngs analysis pipeline where options determine how the data is prepared, what aligner is used, what programs are used to analyse the alignments (variant calling, counting, ...) and which reports are created. With the -preset option, you can select a number of default settings for short and long read genome analysis as well as RNA-seq analysis.

Several types of source data (fastq files, unaligned bams, Complete Genomics analysis dir, ...) are supported. The directory containing the original starting data can be given as an option, an argument, or it can be present in a dir named ori in the sample dir. If given as an argument, a link named ori to the original data will be made. The results (which analyses, etc.) differ according to the type of original data, the parameters given (e.g. use -amplicons for amplicon sequencing) and files in the sampledir.

By default, process_sample will only create files that do no exist yet, or update ones that are older than files they depend on. This way an analysis that was interupted, can be simply restarted (giving the same command), and it will proceed from where it was. New options can also be given/added, e.g. add an extra variant caller, and only the new analyses will be done.

Arguments

oridir: directory containing original data, this can be a data directory as it comes from Complete Genomics, or simply a directory containing fastq files or a bam file. This argument is optional; if not given, a directory named ori containing the source data is expected in the sampledir (This can be a softlink), or a fastq dir with fastq files.
sampledir: name of the sample directory to be created (or completed if it already exists)

Options

As different types of original data are processed differently, not all options are allways applicable. Options that are not applicable to the given type of data are ignored.

By default options are set for short read genomic sequencing (genome/exome/targeted). Presets can be used to set a number of options to the defaults for a given analysis. These options can still be changed by specifically giving them a value after the -preset option (options given later overrule previous ones).

-preset preset: sets a number of options to the defaults for the given "preset", must be one of: srs (short read genomic sequencing), rseq (short read rna-seq), ont (ont genomic sequencing), ontr (ont RNA-seq), scywalker (ont 10x single cell rna-seq), scywalker_pacbio (pacbio 10x single cell rna-seq)
-dbdir dbdir: Some of the analysis require a reference genome and databases; dbdir gives the directory where to find these
-oridir oridir: directory containing original data, this can be a data directory as it comes from Complete Genomics, or simply a directory containing fastq files or a bam file. A softlink to oridir named ori will be made in the sample directory.
-minfastqreads number: if less then number reads are found in the fastq files of the sample, the sample is not processed.
-clip: clip adaptor sequences prior to alignment using fastq-mcf (default 1)
-paired 1/0 (-p): sequenced are paired/unpaired
-adapterfile file: Use file for possible adapter sequences
-removeskew num: -k parameter for sequence clipping using fastq-mcf: sKew percentage-less-than causing cycle removal
-aligners aligner (-a): use the given aligner for mapping to the reference genome (default bwa) Currently supported are: bwa, bowtie2, minimap2_sr, minimap2, ngmlr ; for rna-seq: star, star_2p, hisat2, minimap2_splice
-ali_keepcomments 1/0: set to 1 to transfer sequence comments in the source fastq or ubams to the alignment (default don't keep for fastq, keep for ubams). This option currently only works for minimap2 aligner
-aliformat format: format of the (final) alignment (map) file, this is by default bam, but can be set to cram
-realign value: If value is 0, realignment will not be performed, use 1 for (default) realignment with gatk, or value srma for alignment with srma
-removeduplicates 0/1/picard/biobambam: By default duplicates will be removed (marked actually) using samtools except for amplicon sequencing. With this option you can specifically request or turn of duplicate removal (overruling the default). If you want to use large amounts of memory ;-), you can still use picard for removing duplicates (third option)
-amplicons ampliconfile: This option turns on amplicon sequencing analysis (see further) using the amplicons defained in ampliconfile
-varcallers varcallers: (space separated) list of variant callers to be used (default "gatkh strelka"). Currently supported are: gatk, gatkh (haplotype caller), strelka, sam, freebayes, bcf, longshot, clair3,
-svcallers svcallers: (space separated) list of structural variant callers to be used (default empty). Currently supported are: manta, lumpy, gridds sniffles, cuteSV, npinv
-methcallers methcallers: (space separated) list of methylation callers to be used (default empty). Currently supported are: nanopolish remora (actual remora analysis must have been done before, with methylation data present in the source unaligned bams)
-reftranscripts reftranscripts: file with reference transcripts for isoform calling. (default empty -> finds default in refdb) Currently supported are: flair, isoquant, flames
-isocallers isocallers: (space separated) list of isofrom calling (and counting) programs to be used for rna-seq data. Currently supported are: flair, isoquant, flames
-organelles organelles: (space separated) list of chromosomes that are organelles (that are treated differently in some analysis) If not given explicitely, the ones indicated in the file $refdb/extra/reg_*_organelles.tsv (if present) will be used
-counters counters: (space separated) list of counter programs to be used for rna-seq data. Currently supported are: rnaseqc, qorts
-split 1/0 (-s): split multiple alternative genotypes over different line
-downsampling_type NONE/ALL_READS/BY_SAMPLE/: sets the downsampling type used by GATK (empty for default).
-reports list: use basic (default) for creating most reports (or all for all reports). If you only want some made, give these as a space separated list. Possible reports are: fastqstats fastqc flagstat_reads flagstat_alignments samstats alignedsamstats unalignedsamstats histodepth vars hsmetrics covered histo predictgender
-m maxopenfiles (-maxopenfiles): The number of files that a program can keep open at the same time is limited. pmulticompar will distribute the subtasks thus, that the number of files open at the same time stays below this number. With this option, the maximum number of open files can be set manually (if the program e.g. does not deduce the proper limit, or you want to affect the distribution).
-samBQ number: only for samtools; minimum base quality for a base to be considered (samtools --min-BQ option)
-distrreg regions: distribute regions for parallel processing. Possible options are 0: no distribution (also empty) 1: default distribution schr or schromosome: each chromosome processed separately chr or chromosome: each chromosome processed separately, except the unsorted, etc. with a _ in the name that will be combined), a number: distribution into regions of this size a number preceded by an s: distribution into regions targeting the given size, but breaks can only occur in unsequenced regions of the genome (N stretches) a number preceded by an r: distribution into regions targeting the given size, but breaks can only occur in large (>=100000 bases) repeat regions a number preceded by an g: distribution into regions targeting the given size, but breaks can only occur in large (>=200000 bases) regions without known genes g: distribution into regions as given in the <refdir>/extra/reg_*_distrg.tsv file; if this does not exist uses g5000000 a file name: the regions in the file will be used for distribution
-maxfastqdistr maxfastqdistr: if there are more than maxfastqdistr separate input fastqs, they will be merged into maxfastqdistr fastqs for analysis: If there are many (small) fastqs, the overhead to processes (alignment etc.) them separately (default, to distribute the load) can become too large.
-datatype datatype: Some variant callers (strelka) need to know the type of data (genome, exome or amplicons) for analysis. You can specify it using this option. If not given, it is deduced from acompanying region files (reg_*_amplicons.tsv for ampicons or reg_*_amplicons.tsv for exome)
-hap_bam 0/1: if 1 produce a bam/cram file with haplotype indications (longshot only) (default 0)
-depth_histo_max number: in reports, count positions with up to number depth (default 1000). Larger dfepths will be counted under number
-targetfile targetfile: if targetfile is provided, coverage statistics will be calculated for this region

Following options are available for singlecell or bulk UMI corrected analysis:

-singlecell o/1/ontr10x: set to 1 or ontr10x for (scywalker) long read single cell analysis, only supported for isoquant isoform caller; -sc_expectedcells number: gives the the number of cells expected.; -cellmarkerfile file: A tsv file providing genes that are indicative specific cell cell types.
-tissue string: The tissue type of the sample. If cellmarkerfile is given, only markers of the given tissue are used.
-sc_whitelist file: (for singlecell) file with all possible correct barcodes. By default the whitelist with the 10x version 3 barcodes is used. You can specify to use version 2 of the whitelist by using the shortcut v2
-sc_adaptorseq string: adapter sequence used to find the barcode and UMI. These should follow the given sequence in the reads (default is CTACACGACGCTCTTCCGATCT for singlecell)
-sc_barcodesize number: size of the barcode (default 16 for singlecell)
-sc_umisize number: size of the UMI (default 12 for singlecell)
-addumis 0/1: when 1, add barcode and UMI info to reads for bulk sequencing, in (bulk) isoquant analysis umicounts will be added in the results (barcodes are not used) You can use -sc_adaptorseq, -sc_barcodesize and -sc_umisize to change how barcodes and UMIs are searched (default for -addumis is GATCGGAAGAGCACACGTCTGAACTCCAGTCAC,8,12)

This command can be distributed on a cluster or using multiple cores with job options (more info with cg help joboptions) The option -distrreg can be used to allow a greater distribution by doing some analyses (variantcalling, annotations) split by region (chromosomes) and combinfing the results

Sample types

Several types of sample data are supported:

(targeted) shotgun sequencing

For Illumina sequencing the starting raw data for the sample is fastq files. These should be in a subdirectory of the sampledir named fastq. They can also be in a directory ori in the sampledir, in which case the fastq dir will be made and links to the fastq files made in it.

The names of matching fastq files of paired reads should be consecutive when sorted naturaly,the forward reads first. The usual naming of these files (same name, except for a 1 and 2) is ok. The name of each sample is taken from the sampledir name. The sample name should not contain hyphens (-)

By default reads are clipped using fastq-mcf, aligned to the reference genome in dbdir using bwa mem, duplicates removed (using picard) and realigned (using gatk). Variants are called using gatk and samtools. All files generated have names following the convention of using hyphens to separate different elements about the file. The first element is the type of file. The last element (before the extension) is the sample name. There can be several steps in between. Each sampledir will contain results for this individual sample of the following type:

map-rdsbwa-sample1.bam: bam file created by aligning the reads of sample1 to the reference genome in dbdir using bwa. The bam file has been sorted (s), duplicate marked (d), and realigned (r).

var-gatk-rdsbwa-sample1.tsv: a variant file that contains variants called by gatk based on map-rdsbwa-sample1.bam. Positions with a quality < 30 or coverage < 5 are considered unsequenced. Lower quality variants (but with quality >= 10) are still included in the variant list, but have the a "u" in the sequenced and zyg columns to indicate that they are considered unsequenced
sreg-gatk-rdsbwa-sample1.tsv: A region file with all regions that can be considered sequenced (quality >= 30 and coverage >= 5) using the same methods and quality measures as var-gatk-rdsbwa-sample1.tsv. Any position in those regions that is not in the variant file can be called reference with the same reliability as the variant calls.
varall-gatk-rdsbwa-sample1.tsv: variant calling data by gatk for all positions with >= 5 coverage (also reference called positions). This file is used to create the sreg files, and to update data in making multicompar files later.
reg_cluster-gatk-rdsbwa-S0489.tsv: regions with many clustered variants (which are less reliable)

For strelka variant calling on the same bamfile (map-rdsbwa-sample1.bam), these result files are named var-strelka-rdsbwa-sample1.tsv, sreg-strelka-rdsbwa-sample1.tsv, varall-strelka-rdsbwa-sample1.tsv, reg_cluster-strelka-rdsbwa-S0489.tsv

If the experiment used e.g. exome capture, this can be indicated by the presence of a file named reg_targets.tsv (or matching reg_*_targets*.tsv) in the sampledir (or the option -targetfile). If present, coverage statistics will be calculated for this region.

ONT/nanopore sequencing

For ONT genome/targeted sequencing analysis similar steps are performed, but using the ont preset other programs and settings are selected, e.g. minimap2 for alignment, clair3 for (small) variant calling. You can still deviate from the preset by specifying different options for e.g. variant calling after the preset option. The default ONT pipeline also expects a fastq directory with the fastqs in it per sample, but can also use unaligned bam files (in a direcory named ubam). You would typically use ubams instead of fastq to include methylation calling data produced by remora/dorado at the basecalling step.

Amplicon sequencing

Amplicon sequencing samples are indicated by the presence of a file named reg_amplicons.tsv (or matching reg_*_amplicons*.tsv) in the sampledir. If the option -amplicons is given to the command, a link to the given ampliconfile will be created in the sampledir and used. If an ampliconfile (or link) already exists in the sampledir, it will NOT be overwritten! (only a warning given).

An amplicon file is a tsv file indicating the genomic location of the amplicons It must have the following fields:

chromosome: chromosome of amplicon
begin: start of sequenceable part of amplicon: i.e. at the end of the forward primer
end: end of sequenceable part of amplicon: i.e. before the reverse primer sequence in the genome refernce
outer_begin: start of amplicon including primers, i.e. start of forward primer in the genome
outer_end: end of amplicon including primers, i.e. end of reverse primer sequence in the genome

Amplicon sequencing samples can also start from the fastq files and are processed similarly to shotgun sequencing, but analysis is different in a few ways:

Variants will only be called in the sequenceable part of the amplicons (i.e. between begin and end). (off-target mappings are not called) To avoid wrong results by "sequencing" primers, the primer parts of amplicons will be clipped based on their mapping on the expected amplicons in the bam file. (This is done by replacing the sequence by Ns and reducing quality to 0 for these positions)

Several options use different defaults when amplicon sequencing is specified (-removeduplicates 0 -removeskew 0 -dt NONE).

Complete Genomics sequencing

Complete Genomics source data is already aligned and variant called. The region and variant information is converted to a similar format as use for (Illumina) shotgun sequencing, with some differences:

Naming uses a cg-cg- prefix: var-cg-cg-sample1.tsv, sreg-cg-cg-sample1.tsv, reg_cluster-cg-cg-S0489.tsv, ...
Some files are not present (e.g. no varall)
Extra files, e.g. the directory coverage-cg-sample contains whole genome coverage, refscore, ... data in the form of bcol files.
CG data can include structural variant (cgsv-sample.tsv) and CNV (cgcnv-sample.tsv) calls

RNA-seq (rseq)

Similar to genome analysis, the source data is either fastqs in a directory fastq or unaligned bams in a ubam directory. In addition to variant calls, the short read preset (rseq) will also produce a gene count matrix using QoRTs (changable to e.g. rnaseqc for human) and the ONT preset (ontr) will add isoform predictions and counts to this.

Precalculated data

The sampledir may contain precalculated data data from other pipelines. If these are in the correct format, they will be integrated in the project. vcf files (var-*.vcf) in the ori subdirectory of the sample will be converted to tsv files, and their variants included in the multicompar.

Home

Contact

Installation

Documentation