Process
Format
cg process_sample ?options? ?oridir? sampledir
Summary
Processes one sample directory (projectdir), generating full analysis
information (variant calls, multicompar, reports, ...) starting from
raw sample data that can come from various sources.
Description
The command expects a basic genomecomb sample directory (as
described extensively in projectdir) It
performs a generic ngs analysis pipeline where options determine how
the data is prepared, what aligner is used, what programs are used to
analyse the alignments (variant calling, counting, ...) and which
reports are created. With the -preset option, you can select a number
of default settings for short and long read genome analysis as well
as RNA-seq analysis.
Several types of source data (fastq files, unaligned bams,
Complete Genomics analysis dir, ...) are supported. The directory
containing the original starting data can be given as an option, an
argument, or it can be present in a dir named ori in the sample dir.
If given as an argument, a link named ori to the original data will
be made. The results (which analyses, etc.) differ according to the
type of original data, the parameters given (e.g. use -amplicons for
amplicon sequencing) and files in the sampledir.
By default, process_sample will only create files that do no exist
yet, or update ones that are older than files they depend on. This
way an analysis that was interupted, can be simply restarted (giving
the same command), and it will proceed from where it was. New options
can also be given/added, e.g. add an extra variant caller, and only
the new analyses will be done.
Arguments
- oridir
- directory containing original data, this can be a data
directory as it comes from Complete Genomics, or simply a directory
containing fastq files or a bam file. This argument is optional; if
not given, a directory named ori containing the source data is
expected in the sampledir (This can be a softlink), or a fastq dir
with fastq files.
- sampledir
- name of the sample directory to be created (or completed if
it already exists)
Options
As different types of original data are processed differently, not
all options are allways applicable. Options that are not applicable
to the given type of data are ignored.
By default options are set for short read genomic sequencing
(genome/exome/targeted). Presets can be used to set a number of
options to the defaults for a given analysis. These options can still
be changed by specifically giving them a value after the -preset
option (options given later overrule previous ones).
- -preset preset
- sets a number of options to the defaults for the given
"preset", must be one of: srs (short read genomic
sequencing), rseq (short read rna-seq), ont (ont genomic
sequencing), ontr (ont RNA-seq), scywalker (ont 10x single cell
rna-seq), scywalker_pacbio (pacbio 10x single cell rna-seq)
- -dbdir dbdir
- Some of the analysis require a reference genome and
databases; dbdir gives the directory where to find these
- -oridir oridir
- directory containing original data, this can be a data
directory as it comes from Complete Genomics, or simply a directory
containing fastq files or a bam file. A softlink to oridir named
ori will be made in the sample directory.
- -minfastqreads number
- if less then number reads are found in the fastq files
of the sample, the sample is not processed.
- -clip
- clip adaptor sequences prior to alignment using fastq-mcf
(default 1)
- -paired 1/0 (-p)
- sequenced are paired/unpaired
- -adapterfile file
- Use file for possible adapter sequences
- -removeskew num
- -k parameter for sequence clipping using fastq-mcf: sKew
percentage-less-than causing cycle removal
- -aligners aligner (-a)
- use the given aligner for mapping to the reference genome
(default bwa) Currently supported are: bwa, bowtie2, minimap2_sr,
minimap2, ngmlr ; for rna-seq: star, star_2p, hisat2,
minimap2_splice
- -ali_keepcomments 1/0
- set to 1 to transfer sequence comments in the source fastq or
ubams to the alignment (default don't keep for fastq, keep for
ubams). This option currently only works for minimap2 aligner
- -aliformat format
- format of the (final) alignment (map) file, this is by
default bam, but can be set to cram
- -realign value
- If value is 0, realignment will not be performed, use
1 for (default) realignment with gatk, or value srma for
alignment with srma
- -removeduplicates 0/1/picard/biobambam
- By default duplicates will be removed (marked actually) using
samtools except for amplicon sequencing. With this option you can
specifically request or turn of duplicate removal (overruling the
default). If you want to use large amounts of memory ;-), you can
still use picard for removing duplicates (third option)
- -amplicons ampliconfile
- This option turns on amplicon sequencing analysis (see
further) using the amplicons defained in ampliconfile
- -varcallers varcallers
- (space separated) list of variant callers to be used (default
"gatkh strelka"). Currently supported are: gatk, gatkh
(haplotype caller), strelka, sam, freebayes, bcf, longshot, clair3,
- -svcallers svcallers
- (space separated) list of structural variant callers to be
used (default empty). Currently supported are: manta, lumpy, gridds
sniffles, cuteSV, npinv
- -methcallers methcallers
- (space separated) list of methylation callers to be used
(default empty). Currently supported are: nanopolish remora (actual
remora analysis must have been done before, with methylation data
present in the source unaligned bams)
- -reftranscripts reftranscripts
- file with reference transcripts for isoform calling. (default
empty -> finds default in refdb) Currently supported are: flair,
isoquant, flames
- -isocallers isocallers
- (space separated) list of isofrom calling (and counting)
programs to be used for rna-seq data. Currently supported are:
flair, isoquant, flames
- -organelles organelles
- (space separated) list of chromosomes that are organelles
(that are treated differently in some analysis) If not given
explicitely, the ones indicated in the file
$refdb/extra/reg_*_organelles.tsv (if present) will be used
- -counters counters
- (space separated) list of counter programs to be used for
rna-seq data. Currently supported are: rnaseqc, qorts
- -split 1/0 (-s)
- split multiple alternative genotypes over different line
- -downsampling_type NONE/ALL_READS/BY_SAMPLE/
- sets the downsampling type used by GATK (empty for default).
- -reports list
- use basic (default) for creating most reports (or all for all
reports). If you only want some made, give these as a space
separated list. Possible reports are: fastqstats fastqc
flagstat_reads flagstat_alignments samstats alignedsamstats
unalignedsamstats histodepth vars hsmetrics covered histo
predictgender
- -m maxopenfiles (-maxopenfiles)
- The number of files that a program can keep open at the same
time is limited. pmulticompar will distribute the subtasks thus,
that the number of files open at the same time stays below this
number. With this option, the maximum number of open files can be
set manually (if the program e.g. does not deduce the proper limit,
or you want to affect the distribution).
- -samBQ number
- only for samtools; minimum base quality for a base to be
considered (samtools --min-BQ option)
- -distrreg regions
- distribute regions for parallel processing. Possible options
are 0: no distribution (also empty) 1: default
distribution schr or schromosome: each chromosome processed
separately chr or chromosome: each chromosome processed
separately, except the unsorted, etc. with a _ in the name that
will be combined), a number: distribution into regions of this
size a number preceded by an s: distribution into regions
targeting the given size, but breaks can only occur in unsequenced
regions of the genome (N stretches) a number preceded by an r:
distribution into regions targeting the given size, but breaks can
only occur in large (>=100000 bases) repeat regions a
number preceded by an g: distribution into regions targeting the
given size, but breaks can only occur in large (>=200000 bases)
regions without known genes a file name: the regions in the
file will be used for distribution
- -maxfastqdistr maxfastqdistr
- if there are more than maxfastqdistr separate input
fastqs, they will be merged into maxfastqdistr fastqs for
analysis: If there are many (small) fastqs, the overhead to
processes (alignment etc.) them separately (default, to distribute
the load) can become too large.
- -datatype datatype
- Some variant callers (strelka) need to know the type of data
(genome, exome or amplicons) for analysis. You can specify it using
this option. If not given, it is deduced from acompanying region
files (reg_*_amplicons.tsv for ampicons or reg_*_amplicons.tsv for
exome)
- -hap_bam 0/1
- if 1 produce a bam file with haplotype indictions (longshot
only) (default 0)
- -depth_histo_max number
- in reports, count positions with up to number depth
(default 1000). Larger dfepths will be counted under number
- -targetfile targetfile
- if targetfile is provided, coverage statistics will be
calculated for this region
This command can be distributed on a cluster or using multiple
cores with job options (more info with
cg help joboptions) The option -distrreg can be used to allow a
greater distribution by doing some analyses (variantcalling,
annotations) split by region (chromosomes) and combinfing the results
Sample types
Several types of sample data are supported:
(targeted) shotgun sequencing
For Illumina sequencing the starting raw data for the sample is
fastq files. These should be in a subdirectory of the sampledir named
fastq. They can also be in a directory ori in the sampledir, in which
case the fastq dir will be made and links to the fastq files made in
it.
The names of matching fastq files of paired reads should be
consecutive when sorted naturaly,the forward reads first. The usual
naming of these files (same name, except for a 1 and 2) is ok. The
name of each sample is taken from the sampledir name. The sample name
should not contain hyphens (-)
By default reads are clipped using fastq-mcf, aligned to the
reference genome in dbdir using bwa mem, duplicates removed (using
picard) and realigned (using gatk). Variants are called using gatk
and samtools. All files generated have names following the convention
of using hyphens to separate different elements about the file. The
first element is the type of file. The last element (before the
extension) is the sample name. There can be several steps in between.
Each sampledir will contain results for this individual sample of the
following type:
- map-rdsbwa-sample1.bam
- bam file created by aligning the reads of sample1 to the
reference genome in dbdir using bwa. The bam file has been sorted
(s), duplicate marked (d), and realigned (r).
- var-gatk-rdsbwa-sample1.tsv
- a variant file that contains
variants called by gatk based on map-rdsbwa-sample1.bam. Positions
with a quality < 30 or coverage < 5 are considered
unsequenced. Lower quality variants (but with quality >= 10) are
still included in the variant list, but have the a "u" in
the sequenced and zyg columns to indicate that they are considered
unsequenced
- sreg-gatk-rdsbwa-sample1.tsv
- A region file with all regions that can be considered
sequenced (quality >= 30 and coverage >= 5) using the same
methods and quality measures as var-gatk-rdsbwa-sample1.tsv. Any
position in those regions that is not in the variant file can be
called reference with the same reliability as the variant calls.
- varall-gatk-rdsbwa-sample1.tsv
- variant calling data by gatk for all positions with >= 5
coverage (also reference called positions). This file is used to
create the sreg files, and to update data in making multicompar
files later.
- reg_cluster-gatk-rdsbwa-S0489.tsv
- regions with many clustered variants (which are less
reliable)
For strelka variant calling on the same bamfile
(map-rdsbwa-sample1.bam), these result files are named
var-strelka-rdsbwa-sample1.tsv, sreg-strelka-rdsbwa-sample1.tsv,
varall-strelka-rdsbwa-sample1.tsv,
reg_cluster-strelka-rdsbwa-S0489.tsv
If the experiment used e.g. exome capture, this can be indicated
by the presence of a file named reg_targets.tsv (or matching
reg_*_targets*.tsv) in the sampledir (or the option -targetfile). If
present, coverage statistics will be calculated for this region.
ONT/nanopore sequencing
For ONT genome/targeted sequencing analysis similar steps are
performed, but using the ont preset other programs and settings are
selected, e.g. minimap2 for alignment, clair3 for (small) variant
calling. You can still deviate from the preset by specifying
different options for e.g. variant calling after the preset option.
The default ONT pipeline also expects a fastq directory with the
fastqs in it per sample, but can also use unaligned bam files (in a
direcory named ubam). You would typically use ubams instead of fastq
to include methylation calling data produced by remora/dorado at the
basecalling step.
Amplicon sequencing
Amplicon sequencing samples are indicated by the presence of a
file named reg_amplicons.tsv (or matching reg_*_amplicons*.tsv) in
the sampledir. If the option -amplicons is given to the command, a
link to the given ampliconfile will be created in the sampledir and
used. If an ampliconfile (or link) already exists in the sampledir,
it will NOT be overwritten! (only a warning given).
An amplicon file is a tsv file indicating
the genomic location of the amplicons It must have the following
fields:
- chromosome
- chromosome of amplicon
- begin
- start of sequenceable part of amplicon: i.e. at the end of
the forward primer
- end
- end of sequenceable part of amplicon: i.e. before the reverse
primer sequence in the genome refernce
- outer_begin
- start of amplicon including primers, i.e. start of forward
primer in the genome
- outer_end
- end of amplicon including primers, i.e. end of reverse primer
sequence in the genome
Amplicon sequencing samples can also start from the fastq files
and are processed similarly to shotgun sequencing, but analysis is
different in a few ways:
Variants will only be called in the sequenceable part of the
amplicons (i.e. between begin and end). (off-target mappings are not
called) To avoid wrong results by "sequencing" primers, the
primer parts of amplicons will be clipped based on their mapping on
the expected amplicons in the bam file. (This is done by replacing
the sequence by Ns and reducing quality to 0 for these positions)
Several options use different defaults when amplicon sequencing is
specified (-removeduplicates 0 -removeskew 0 -dt NONE).
Complete Genomics sequencing
Complete Genomics source data is already aligned and variant
called. The region and variant information is converted to a similar
format as use for (Illumina) shotgun sequencing, with some
differences:
- Naming uses a cg-cg- prefix: var-cg-cg-sample1.tsv,
sreg-cg-cg-sample1.tsv, reg_cluster-cg-cg-S0489.tsv, ...
- Some files are not present (e.g. no varall)
- Extra files, e.g. the directory coverage-cg-sample contains
whole genome coverage, refscore, ... data in the form of bcol files.
- CG data can include structural variant (cgsv-sample.tsv) and
CNV (cgcnv-sample.tsv) calls
RNA-seq (rseq)
Similar to genome analysis, the source data is either fastqs in a
directory fastq or unaligned bams in a ubam directory. In addition to
variant calls, the short read preset (rseq) will also produce a gene
count matrix using QoRTs (changable to e.g. rnaseqc for human) and
the ONT preset (ontr) will add isoform predictions and counts to
this.
Precalculated data
The sampledir may contain precalculated data data from other
pipelines. If these are in the correct format, they will be
integrated in the project. vcf files (var-*.vcf) in the ori
subdirectory of the sample will be converted to tsv files, and their
variants included in the multicompar.
Category
Process