GenomeComb
The standard file format used in GenomeComb is the widely supported, simple, yet flexible tab-separated values file (tsv). A tsv file is a simple text file containing tabular data, where each line represents a record or row in the table. Each field value of a record is separated from the next by a tab character. As values cannot be quoted (as in csv), they cannot contain tabs or newlines, unless by coding them e.g. by using escape characters (\t,\n)
The first line of the tab file (not starting with a #) is a header that contains column names (or fields). Genomecomb allows comment lines (indicated by starting the line with a # character) containing metadata to precede the header. The convention genomecomb uses for storing metadata in the comments is described at the bottom of this help
Depending on the columns present, tsv files can be used for various purposes. Usually the files are used to describe features on a reference genome sequence. In this context, Genomecomb recognizes columns with specific field names to have a special meaning. The order/position of the columns does not matter, although genomecomb will usually write tsv files with columns in a specific order. All tsv files can be easily queried using the cg select functionality or loaded into a local database.
Region files are used to indicate regions on the genome, and potentially associate some name, score, annotation, ... to it. Region files contain at least the following fields:
Coordinates are in zero-based half-open format as used by UCSC bed files and Complete Genomics files. This means for instance that the first base of a sequence will be indicated by begin=0 and end=1. It is possible to indicate regions not containing a base, e.g. the position/region before the first base would be indicated by begin=0, end=0. chromosome is a just a string. Many genomecomb tools allow mixing "chr1" and "1" type of notation for chromosome, meaning that "chr1 1 2" is considered the same region as "1 1 2".
Most genomecomb tools expect region files to be sorted on chromosome, begin, end using a natural sort for chromosome names (.i.e sorted alphabetically, but with embedded numbers interpreted as numbers, e.g. chr1,chr2,chr11). Most of the files created by genomecomb are already properly sorted; You can sort files using the -s option of cg select. Use -s - to sort using the default expected sort order: cg select -s - file sortedfile
If the normal chromosome, begin, end fields are not present, the following alternative fieldnames are also recognized:
A variant file is a tsv file containing a list of variants. chromosome, begin and end fields indicate the location of the variant in the same way as in region files, while other fields define the variant at the location: The following basic fields are present in a variant file:
Variant files should be sorted on the fields chromosome, begin, end, type, alt (in that order).
Further fields can be present to describe the variant in the sample or annotation information, many depending on the variant caller used (e.g. gatk_vars and sam_vars). The following fields have specific meanings in genomecomb:
genomecomb has 2 options to deal with the presence of multiple alternative alleles on the same position:
The sequenced field indicates sequenig status of the variant in the sample. The following codes can be found:
When calling variants using GATK or samtools, genomecomb picks a relatively low quality treshhold (coverage < 5 or quality < 30) for considering variants unsequenced (sensitiviy over specificity). You can allways apply more stringent quality filtering on the result using cg select.
Possible zyg codes are:
Structural variants use basically the same format as small variants, but are usually in separate files because they are compared and queried differently. The only difference is in the extra types (inv, trans, bnd) and different alt notations for these types. For an inversion (inv) the alternative genotype (alt) is indicated by an "i". Translocations (trans) and breakends (bnd) are indicated by a position on the breakpoint ( zero length location like for an insertion). The alt is indicated by the following patterns, where chr:pos gives the location the variant location is linked to. (AS for insertions, pos here points to a location between bases) s can be pesent to indicate inserted bases in the breakpoint:
In a multicompar file, data for different samples is present in one file, so they can be compared. Fields that are specific to a sample have the samplename added to the fieldname separated by a dash, e.g. the zygosity of a variant in sample1 can be found in the column named zyg-sample1. The same sample can be analysed using different methods (aligners, variant callers). This is indicated using fieldname-methods-sample. (The methods-sample part is called the anlysis.)
A small example multicompar variant file with two samples would contain the following fields
Some formats used in bioinformatics contain data in a tab separated format where the header does not conform to the tsv specs. Most Genomecomb commands will detect and support some of these alternative comments/header styles:
The following conventions are used by genomecomb for storing metadata in the comments of a tsv file: The comment character (#) is followed by a keyword and a value separated by a tab character, e.g. the first comment lines will usually contain the "filetype" keyword and value, followed by fileversion and split status:
#filetype tsv/varfile #fileversion 0.99 #split 1
The same keyword (with different valuesd) can be repeated. In this case the values can be interpreted as a list. If the first value of such a list is "table" the list contains a tab-separated table: The next element (after table) contains the header, the following tab-separated lines with the data, e.g A typical table is the fields table describing the fields present in the tsv file, e.g.:
#fields table #fields field number type description #fields chromosome 1 String Chromosome/Contig #fields begin 1 Integer Begin of feature (0 based - half open) #fields end 1 Integer End of feature (0 based - half open) #fields type 1 String Type of feature (snp,del,ins,...) #fields ref 1 String Reference sequence, can be a number for large features #fields alt 1 String Alternative sequence, can be a number for large features #fields name 1 String name of feature #fields quality 1 Float Quality score of feature #fields filter 1 String Filter value #fields alleleSeq1 1 String allele present on first chromosome/haplotype #fields alleleSeq2 1 String allele present on second chromosome/haplotype #fields sequenced 1 String sequenced status: v = variant, r = reference (i.e. not this variant), u = unsequenced #fields zyg 1 String Zygosity status: m = homozygous, t = heterozygous, r = reference, o = other variant, v = variant but genotype unspecified, c = compound (i.e. genotype has this variant and other variant), u = unsequenced geno #fields phased 1 Integer Phased status: 0 if not phased, other integer if phased #fields genotypes H Integer Genotypes #fields alleledepth_ref 1 Integer Allelic depths for the ref allele #fields alleledepth A Integer Allelic depths for the alt alleles in the order listed #fields frequency A Float Allele Frequency
The fields table contains 4 columns: field contains the field name, number is 1 if the column contains a single value, A if it contains lists with a value for alternative allele or H if it contains a value for each haplotype/chromosome. type indicates the type of value in the column and description gives a textual description of the field.