genome project directory
Although most genomecomb commands can be individually run on files
located anywhere, some of the commands expect or generate data
organised in a specific project structure: A genomecomb project
directory or projectdir for short is a directory containing
(links to) raw data and analysis in this particular structure, which
is described in this help.
The cg proccess_project
command can e.g. be used to generate a projectdir with full analysis
information (variant calls, multicompar, reports, ...) starting from
raw sample data from various sources.
overview
A projectdir basically contains individual sample directories (in
the subdirectory samples) and overview data (samples comparisons) in
the following structure:
- samples
- directory containing a separate sample directory for each
sample
- compar
- directory containing files that combine and compare data from
all samples (multicompar files)
- projectinfo.tsv
- a file with some meta data about the analysis/data in the
projectdir As long as compatible analysis (same reference genome,
split or unsplit variants) were used, the same sampledir can be
used in multiple projects (e.g. using a soft link).
The project name is taken from the filename of the
projectdir. The project name will be used (a.o.) in naming most of
the overview files. These filenames will end with a hyphen-minus
character followed by the projectsname. For this reason, the
hyphen-minus character may not be used in the projectname.
Most of the result files in a projectdir/sampledir are
tab-separated value files (file extension tsv) of various types
(described in format_tsv). For space
reasons, files are often compressed. genomecomb tools can generally
handle compressed files transparently.
sample directory
Indivual sample data is in subdirectories of the samples directory
in the projectdir. Each of these sampledirs contains the raw data and
analysed data from one sample. The sample name is taken from
the filename of the sampledir. As hyphen-minus characters are used in
naming the analysis results files ending with the sample name, this
character (-) should not be present in the name.
sample source data
- ori
- A sampledir can contain a (link to a) directory containing
the original sequnencing data, named ori. The commands cg process_sample or cg process_project can be used
to analyse the data and produce a fully filled sampledir/projectdir
- fastq
- If the original data is in the form of fastq files, the fastq
files for that sample are present in a subdirectory named
fastq. (If fastq files are found in the ori directory, a
fastq dir is made, and the files linked.) Any of the commonly used
file name extensions (.fastq.gz, .fq.gz, .fastq, .fq) are
recognised The names of matching fastq files of paired reads should
be consecutive when sorted naturaly,the forward reads first. The
usual naming of these files (same name, except for a 1 and 2) is
ok.
- ubam
- sequencing data as ubams (unaligned bams) is also accepted.
These should be typically in a directory named ubam instead of
fastq, although they will be detected in a fastq directory as well.
If both a ubam and fastq directory are present, the ubam gets
priority.
sample results
All files generated have names following the convention of using
hyphen-minus to separate different elements of the file. The first
element indicates what is in the file. The last element (before the
extension) is the sample name. There can be several steps in between.
Each sampledir can contain results for this individual sample of
the following type (depending on source data):
- map-rdsbwa-sample1.bam
- bam file created by aligning the reads of sample1 to the
reference genome in refdir using bwa. The bam file has been sorted
(s), duplicate marked (d), and realigned (r).
- var-gatk-rdsbwa-sample1.tsv.zst
- a (compressed) tsv variant file
that contains variants called by gatk based on
map-rdsbwa-sample1.bam.
- sreg-gatk-rdsbwa-sample1.tsv.zst
- A region file with all regions that can be considered
sequenced using the same methods and quality measures as
var-gatk-rdsbwa-sample1.tsv.zst. Any position in those regions that
is not in the variant file can be called reference with the same
reliability as the variant calls.
- varall-gatk-rdsbwa-sample1.tsv.zst
- variant file containing variant calls by gatk for all
positions with >= 5 coverage (also reference called positions).
This file is used to create the sreg files, and to update data in
making multicompar files later.
- reg_cluster-gatk-rdsbwa-S0489.tsv.zst
- regions with many clustered variants (which are less
reliable)
- bcolall
- directory containing whole genome coverage, refscore, ...
data in the form of bcol files. These files
can be used to create the sreg files, and to update data in making
multicompar files later. (In older project dirs, this directory may
be called coverage-cg-* and contain old style formatted bcol files)
- sv-manta-rdsbwa-sample.tsv.zst
- structural variant calls by manta
- cgsv-sample.tsv.zst
- Complete Genomics structural variants
- cgcnv-sample.tsv.zst
- Complete Genomics CNV data
The result files from samtools variant calling on the same
bamfile (map-rdsbwa-sample1.bam), are named
var-sam-rdsbwa-sample1.tsv.zst, sreg-sam-rdsbwa-sample1.tsv.zst,
varall-sam-rdsbwa-sample1.tsv.zst,
reg_cluster-sam-rdsbwa-S0489.tsv.zst
For Complete Genomics alignment and variant calling the files are
named var-cg-cg-sample1.tsv.zst, sreg-cg-cg-sample1.tsv.zst,
reg_cluster-cg-cg-S0489.tsv.zst
The sampledir may contain precalculated data data from other
pipelines. If these are in the correct format, they will be
integrated in the project. vcf files (var-*.vcf) will be converted to
tsv files, and their variants included in the multicompar.
compar dir
The subdirectory compar contains comparisons of all samples, e.g.:
- annot_compar-projectname.tsv.zst
- multicompar file containing information for all variants in
all samples (and all methods). If a variant is not present in one
of the samples, the information at the position of the variant will
be completed (is the position sequenced or not, coverage, ...) The
file is also annotated with all databases in refdir (impact on
genes, regions of interest, known variant data)
- sreg-projectname.tsv.zst
- sequenced region multicompar file containing for all regions
whether they are sequenced (1) or nor (0) for each sample.
- annot_sv-projectname.tsv.zst
- multicompar structural variant file containing information
for all structural variants in all samples (and all methods). This
file is made differently from the small vrariants file: Structural
variant comparison uses approximate matching: Inversion and
deletions are matched if they overlap at least 75%, and the begin
and end positions differ less than 300 bases. For insertions and
translocation, a difference of 30 bases in position is allowed (by
default). Also, for structural variants information will not be
completed (is the position sequenced or not, coverage, ...) for
samples without a variant call. The file is also annotated with all
databases in refdir (impact on genes, regions of interest, known
variant data)
analysisinfo files
Most files have an accompanying analysisinfo file (same name as
the file, but with the extension .analysisinfo added). These are tsv
files containing information about how the file was made (which
programs were used, which versions, settings, ...)
projectinfo.tsv
projectinfo.tsv is a tsv file containing
data about the project. It must have 2 columns: key and value. The
following keys can be found:
- refdir
- directory containing reference data (genome sequence,
annotation, ...). projectinfo.tsv file.
- split
- if 1, each alternative allele is on a separate line. If 0,
multiple alternative alleles in the sample location and allele
specific data are on one line, the relevant fields containing
(comma separated) lists.