GenomeComb

tsv format

The standard file format used in GenomeComb is the widely supported, simple, yet flexible tab-separated values file (tsv). A tsv file is a simple text file containing tabular data, where each line represents a record or row in the table. Each field value of a record is separated from the next by a tab character. As values cannot be quoted (as in csv), they cannot contain tabs or newlines, unless by coding them e.g. by using escape characters (\t,\n)

The first line of the tab file (not starting with a #) is a header that contains column names (or fields). Genomecomb allows comment lines (indicated by starting the line with a # character) containing metadata to precede the header. The convention genomecomb uses for storing metadata in the comments is described at the bottom of this help

Depending on the columns present, tsv files can be used for various purposes. Usually the files are used to describe features on a reference genome sequence. In this context, Genomecomb recognizes columns with specific field names to have a special meaning. The order/position of the columns does not matter, although genomecomb will usually write tsv files with columns in a specific order. All tsv files can be easily queried using the cg select functionality or loaded into a local database.

region file format

Region files are used to indicate regions on the genome, and potentially associate some name, score, annotation, ... to it. Region files contain at least the following fields:

chromosome
a string indicating the chromosome name.
begin
a number indicating the start of the region (half-open coordinates).
end
a number indicating the end of the region (half-open coordinates) Any extra columns can be added to provide information on the region.

Coordinates are in zero-based half-open format as used by UCSC bed files and Complete Genomics files. This means for instance that the first base of a sequence will be indicated by begin=0 and end=1. It is possible to indicate regions not containing a base, e.g. the position/region before the first base would be indicated by begin=0, end=0. chromosome is a just a string. Many genomecomb tools allow mixing "chr1" and "1" type of notation for chromosome, meaning that "chr1 1 2" is considered the same region as "1 1 2".

Most genomecomb tools expect region files to be sorted on chromosome, begin, end using a natural sort for chromosome names (.i.e sorted alphabetically, but with embedded numbers interpreted as numbers, e.g. chr1,chr2,chr11). Most of the files created by genomecomb are already properly sorted; You can sort files using the -s option of cg select. Use -s - to sort using the default expected sort order: cg select -s - file sortedfile

If the normal chromosome, begin, end fields are not present, the following alternative fieldnames are also recognized:

chromosome
"chrom", "chr", "chr1", "genoName", "tName" and "contig"
begin
"start", "end1", "chromStart", "genoStart", "tStart", txStart", "pos" and "offset" (end1 is recognised as begin because of the structural variant code in genomecomb, where start1,end1 and start2,end2 regions surround a SV).
end
"start2", "chromEnd", "genoEnd", "tEnd" or "txEnd"

variant file format

A variant file is a tsv file containing a list of variants. chromosome, begin and end fields indicate the location of the variant in the same way as in region files, while other fields define the variant at the location: The following basic fields are present in a variant file:

chromosome
chromosome name.
begin
start of the region (half-open coordinates).
end
end of the region (half-open coordinates)
type
type of variation: snp, ins, del, sub are recognised
ref
genotype of the reference sequence at the feature. For large deletions, the size of the deletion can be used. Insertions will have an empty string as ref. (also reference)
alt
alternative allele(s). If there are more than one alternatives, they are separated by commas. Deletions have an empty string as alt allele. (also alternative)

Variant files should be sorted on the fields chromosome, begin, end, type, alt (in that order).

Further fields can be present to describe the variant in the sample or annotation information, many depending on the variant caller used (e.g. gatk_vars and sam_vars). The following fields have specific meanings in genomecomb:

sequenced
single letter code describing sequencing status (described below)
zyg
zygosity, a single letter code indicating the zygosity of sample for the variant. (described below)
alleleSeq1
genotype of variant at one allele
alleleSeq2
genotype of variant at other allele
phased
order of genotypes in alleleSeq1 and alleleSeq2 is significant (phase is known)
quality
quality of the variant (or reference) call, assigned by the variant caller. Normally phred scaled: -10log_10 prob(call in ALT is wrong))
coverage
number of reads used to call the variant (covering the variant). This is as reported by the variant caller; Interpretation of coverage can differ between different callers

genomecomb has 2 options to deal with the presence of multiple alternative alleles on the same position:

split
Each alternative allele is on a seperate line. e.g. A to G,C variant (multialleic notation) is split into an A to G and an A to C variant. While cg select can select based on lists in fields (as in multiallelic mode), split mode makes querying and selection of variants much easier. Split mode is the default.
multiallelic
one line per position and type. All alternative alleles are in one line in the variant file. The alt field contains a list (separated by commas) of alternative alleles. If any of the other fields contains values specific to an allele (e.g. frequency of the allele in a population), this field will contain a comma separated list with values in the order as the alt alleles list

sequenced field

The sequenced field indicates sequenig status of the variant in the sample. The following codes can be found:

u
the position is considered unsequenced in the sample (e.g. because coverage or quality was too low).
v
the variant was found in the sample.
r
the position was sequenced, but the given variant is not present In multiallelic mode, r allways means that the genotype is reference. In split mode however, "v" will only be assigned if the specific alternative is present in the genotype. So "r" will be used even if there are non-reference alleles, as long as they are not the given alternative!

When calling variants using GATK or samtools, genomecomb picks a relatively low quality treshhold (coverage < 5 or quality < 30) for considering variants unsequenced (sensitiviy over specificity). You can allways apply more stringent quality filtering on the result using cg select.

zyg

Possible zyg codes are:

m
homozygous; the sample has two times the given alternative allele
t
heterozygous; the sample has the given alternative allele and one reference allele
c
compound; the sample has two different non-reference alleles. In split mode, c is only used if one the those is the given variant alt allele.
o
other; This is only used in split mode when the sample contains non-reference alleles other than the variant alt allele.
r
reference
v
variant, but genotype was not specified
u
unsequenced It is possible to have an assigned zyg other than u (e.g. t) even when the sequenced field is u, meaning that the variant caller could make a zygosity estimate/prediction, but the variant call is not of enough quality to consider it sequenced.

Structural variants

Structural variants use basically the same format as small variants, but are usually in separate files because they are compared and queried differently. The only difference is in the extra types (inv, trans, bnd) and different alt notations for these types. For an inversion (inv) the alternative genotype (alt) is indicated by an "i". Translocations (trans) and breakends (bnd) are indicated by a position on the breakpoint ( zero length location like for an insertion). The alt is indicated by the following patterns, where chr:pos gives the location the variant location is linked to. (AS for insertions, pos here points to a location between bases) s can be pesent to indicate inserted bases in the breakpoint:

multicompar file

In a multicompar file, data for different samples is present in one file, so they can be compared. Fields that are specific to a sample have the samplename added to the fieldname separated by a dash, e.g. the zygosity of a variant in sample1 can be found in the column named zyg-sample1. The same sample can be analysed using different methods (aligners, variant callers). This is indicated using fieldname-methods-sample. (The methods-sample part is called the anlysis.)

A small example multicompar variant file with two samples would contain the following fields

chromosome
chromosome name.
begin
start of the region (half-open coordinates).
end
end of the region (half-open coordinates)
type
type of variation: snp, ins, del, sub are recognised
ref
reference sequence
alt
alternative allele(s).
sequenced-gatkh-rdsbwa-sample1
sequencing status of sample1, based on the gatk haplotypecaller (gatkh) on a cleaned bwa alignment
zyg-gatkh-rdsbwa-sample1
zygosity of sample1
quality-gatkh-rdsbwa-sample1
variant quality in sample1
alleleSeq1-gatkh-rdsbwa-sample1
genotype of variant at one allele in sample1
alleleSeq2-gatkh-rdsbwa-sample1
genotype of variant at other allele in sample1
sequenced-gatkh-rdsbwa-sample2
sequencing status of sample2
zyg-gatkh-rdsbwa-sample2
zygosity of sample2
quality-gatkh-rdsbwa-sample2
variant quality in sample2
alleleSeq1-gatkh-rdsbwa-sample2
genotype of variant at one allele in sample2
alleleSeq2-gatkh-rdsbwa-sample2
genotype of variant at other allele in sample2

tab based bioinformatics formats

Some formats used in bioinformatics contain data in a tab separated format where the header does not conform to the tsv specs. Most Genomecomb commands will detect and support some of these alternative comments/header styles:

sam
starts with "@HD VN", header lines start with @, uses fixed columns
vcf
starts with "fileformat=VCF", the last "comment" line contains the header. In order to extract the data merged in some of the vcf fields into a genomecomb supported tsv, use cg vcf2tsv
Complete genomics
header line is preceeded by an empty line and starts with a > character

metadata in comments format

The following conventions are used by genomecomb for storing metadata in the comments of a tsv file: The comment character (#) is followed by a keyword and a value separated by a tab character, e.g. the first comment lines will usually contain the "filetype" keyword and value, followed by fileversion and split status:

#filetype   tsv/varfile
#fileversion    0.99
#split  1

The same keyword (with different valuesd) can be repeated. In this case the values can be interpreted as a list. If the first value of such a list is "table" the list contains a tab-separated table: The next element (after table) contains the header, the following tab-separated lines with the data, e.g A typical table is the fields table describing the fields present in the tsv file, e.g.:

#fields table
#fields field   number  type    description
#fields chromosome  1   String  Chromosome/Contig
#fields begin   1   Integer Begin of feature (0 based - half open)
#fields end 1   Integer End of feature (0 based - half open)
#fields type    1   String  Type of feature (snp,del,ins,...)
#fields ref 1   String  Reference sequence, can be a number for large features
#fields alt 1   String  Alternative sequence, can be a number for large features
#fields name    1   String  name of feature
#fields quality 1   Float   Quality score of feature
#fields filter  1   String  Filter value
#fields alleleSeq1  1   String  allele present on first chromosome/haplotype
#fields alleleSeq2  1   String  allele present on second chromosome/haplotype
#fields sequenced   1   String  sequenced status: v = variant, r = reference (i.e. not this variant), u = unsequenced
#fields zyg 1   String  Zygosity status: m = homozygous, t = heterozygous, r = reference, o = other variant, v = variant but genotype unspecified, c = compound (i.e. genotype has this variant and other variant), u = unsequenced geno
#fields phased  1   Integer Phased status: 0 if not phased, other integer if phased
#fields genotypes   H   Integer Genotypes
#fields alleledepth_ref 1   Integer Allelic depths for the ref allele
#fields alleledepth A   Integer Allelic depths for the alt alleles in the order listed
#fields frequency   A   Float   Allele Frequency

The fields table contains 4 columns: field contains the field name, number is 1 if the column contains a single value, A if it contains lists with a value for alternative allele or H if it contains a value for each haplotype/chromosome. type indicates the type of value in the column and description gives a textual description of the field.