GenomeComb

genome project directory

Although most genomecomb commands can be individually run on files located anywhere, some of the commands expect or generate data organised in a specific project structure: A genomecomb project directory or projectdir for short is a directory containing (links to) raw data and analysis in this particular structure, which is described in this help.

The cg proccess_project command can e.g. be used to generate a projectdir with full analysis information (variant calls, multicompar, reports, ...) starting from raw sample data from various sources.

overview

A projectdir basically contains individual sample directories (in the subdirectory samples) and overview data (samples comparisons) in the following structure:

samples
directory containing a separate sample directory for each sample
compar
directory containing files that combine and compare data from all samples (multicompar files)
projectinfo.tsv
a file with some meta data about the analysis/data in the projectdir As long as compatible analysis (same reference genome, split or unsplit variants) were used, the same sampledir can be used in multiple projects (e.g. using a soft link).

The project name is taken from the filename of the projectdir. The project name will be used (a.o.) in naming most of the overview files. These filenames will end with a hyphen-minus character followed by the projectsname. For this reason, the hyphen-minus character may not be used in the projectname.

Most of the result files in a projectdir/sampledir are tab-separated value files (file extension tsv) of various types (described in format_tsv). For space reasons, files are often compressed. genomecomb tools can generally handle compressed files transparently.

sample directory

Indivual sample data is in subdirectories of the samples directory in the projectdir. Each of these sampledirs contains the raw data and analysed data from one sample. The sample name is taken from the filename of the sampledir. As hyphen-minus characters are used in naming the analysis results files ending with the sample name, this character (-) should not be present in the name.

sample source data

ori
A sampledir can contain a (link to a) directory containing the original sequnencing data, named ori. The commands cg process_sample or cg process_project can be used to analyse the data and produce a fully filled sampledir/projectdir
fastq
If the original data is in the form of fastq files, the fastq files for that sample are present in a subdirectory named fastq. (If fastq files are found in the ori directory, a fastq dir is made, and the files linked.) Any of the commonly used file name extensions (.fastq.gz, .fq.gz, .fastq, .fq) are recognised The names of matching fastq files of paired reads should be consecutive when sorted naturaly,the forward reads first. The usual naming of these files (same name, except for a 1 and 2) is ok.
ubam
sequencing data as ubams (unaligned bams) is also accepted. These should be typically in a directory named ubam instead of fastq, although they will be detected in a fastq directory as well. If both a ubam and fastq directory are present, the ubam gets priority.

sample results

All files generated have names following the convention of using hyphen-minus to separate different elements of the file. The first element indicates what is in the file. The last element (before the extension) is the sample name. There can be several steps in between.

Each sampledir can contain results for this individual sample of the following type (depending on source data):

map-rdsbwa-sample1.bam
bam file created by aligning the reads of sample1 to the reference genome in refdir using bwa. The bam file has been sorted (s), duplicate marked (d), and realigned (r).
var-gatk-rdsbwa-sample1.tsv.zst
a (compressed) tsv variant file that contains variants called by gatk based on map-rdsbwa-sample1.bam.
sreg-gatk-rdsbwa-sample1.tsv.zst
A region file with all regions that can be considered sequenced using the same methods and quality measures as var-gatk-rdsbwa-sample1.tsv.zst. Any position in those regions that is not in the variant file can be called reference with the same reliability as the variant calls.
varall-gatk-rdsbwa-sample1.tsv.zst
variant file containing variant calls by gatk for all positions with >= 5 coverage (also reference called positions). This file is used to create the sreg files, and to update data in making multicompar files later.
reg_cluster-gatk-rdsbwa-S0489.tsv.zst
regions with many clustered variants (which are less reliable)
bcolall
directory containing whole genome coverage, refscore, ... data in the form of bcol files. These files can be used to create the sreg files, and to update data in making multicompar files later. (In older project dirs, this directory may be called coverage-cg-* and contain old style formatted bcol files)
sv-manta-rdsbwa-sample.tsv.zst
structural variant calls by manta
cgsv-sample.tsv.zst
Complete Genomics structural variants
cgcnv-sample.tsv.zst
Complete Genomics CNV data

The result files from samtools variant calling on the same bamfile (map-rdsbwa-sample1.bam), are named var-sam-rdsbwa-sample1.tsv.zst, sreg-sam-rdsbwa-sample1.tsv.zst, varall-sam-rdsbwa-sample1.tsv.zst, reg_cluster-sam-rdsbwa-S0489.tsv.zst

For Complete Genomics alignment and variant calling the files are named var-cg-cg-sample1.tsv.zst, sreg-cg-cg-sample1.tsv.zst, reg_cluster-cg-cg-S0489.tsv.zst

The sampledir may contain precalculated data data from other pipelines. If these are in the correct format, they will be integrated in the project. vcf files (var-*.vcf) will be converted to tsv files, and their variants included in the multicompar.

compar dir

The subdirectory compar contains comparisons of all samples, e.g.:

annot_compar-projectname.tsv.zst
multicompar file containing information for all variants in all samples (and all methods). If a variant is not present in one of the samples, the information at the position of the variant will be completed (is the position sequenced or not, coverage, ...) The file is also annotated with all databases in refdir (impact on genes, regions of interest, known variant data)
sreg-projectname.tsv.zst
sequenced region multicompar file containing for all regions whether they are sequenced (1) or nor (0) for each sample.
annot_sv-projectname.tsv.zst
multicompar structural variant file containing information for all structural variants in all samples (and all methods). This file is made differently from the small vrariants file: Structural variant comparison uses approximate matching: Inversion and deletions are matched if they overlap at least 75%, and the begin and end positions differ less than 300 bases. For insertions and translocation, a difference of 30 bases in position is allowed (by default). Also, for structural variants information will not be completed (is the position sequenced or not, coverage, ...) for samples without a variant call. The file is also annotated with all databases in refdir (impact on genes, regions of interest, known variant data)

analysisinfo files

Most files have an accompanying analysisinfo file (same name as the file, but with the extension .analysisinfo added). These are tsv files containing information about how the file was made (which programs were used, which versions, settings, ...)

projectinfo.tsv

projectinfo.tsv is a tsv file containing data about the project. It must have 2 columns: key and value. The following keys can be found:

refdir
directory containing reference data (genome sequence, annotation, ...). projectinfo.tsv file.
split
if 1, each alternative allele is on a separate line. If 0, multiple alternative alleles in the sample location and allele specific data are on one line, the relevant fields containing (comma separated) lists.