Annotate

Format

cg annotate ?options? variantfile resultfile dbfile ...

Summary

Annotate a variant file with region, gene or variant data

Description

Adds new columns with annotation to variantfile. Each dbfile will add 1 or more columns to the resultfile. different types of dbfiles are treated differently. The type is determined based on the first part of the filename (before the first underscore). Each column will start with a base name (the part of the filename after the last underscore)

Arguments

variantfile: file in tsv format with variant data
resultfile: resulting file in tsv format with new columns added
dbfile: files (tsv format) with features used for annotation. If a directory is given for dbfile, all known anotation files in this directory will be used for annotation

Options

-near dist: also annotate variants with the nearest feature in the dbfile if it is closer than dist to it. A column name_dist will be added that contains the distance. This option is only used for region annotation
-name namefield: The name added as annotation normally is taken from a field called name in the database file, or a field specified in the database opt file. Using -name you can explicitely choose the field to be used.
-u upstreamsize: The number of nts that will be considered up/downstream for gene annotation (default 2000)
-dbdir dbdir: can (optionally) be used to explicitely specify the directory containing the reference databases (The genome sequence genome_*.ifas in it is used for gene annotation). By default the directory the gene annotation file is in will be used. Annotation databases in dbdir will not automatically be used to annotate by using this option. (You will need to add dbdir as a dbfile parameter for that)
-replace y/n/e/a: what to do if annotation fields to be added are already in the variantfile: e (default) - give an error y - replace them with new annotation if dbfile is newer, n - keep the old annotation, a - allways replace them with new information
-distrreg: distribute regions for parallel processing. Possible options are: 0: no distribution (also empty) 1: default distribution schr or schromosome: each chromosome processed separately chr or chromosome: each chromosome processed separately, except the unsorted, etc. with a _ in the name that will be combined), a number: distribution into regions of this size a number preceded by an s: distribution into regions targeting the given size, but breaks can only occur in unsequenced regions of the genome (N stretches) a number preceded by an r: distribution into regions targeting the given size, but breaks can only occur in large (>=100000 bases) repeat regions a number preceded by an g: distribution into regions targeting the given size, but breaks can only occur in large (>=200000 bases) regions without known genes g: distribution into regions as given in the <refdir>/extra/reg_*_distrg.tsv file; if this does not exist uses g5000000 a file name: the regions in the file will be used for distribution
-margin number: (SV only) Allow begin and end to deviate the number of bases given (default 30)
-lmargin number: (SV only) Allow begin and end to deviate the number of bases given for deletions, inversions (default 300)
-tmargin number: (SV only) Allow begin and end to deviate the number of bases given for translocations (trans) and breakends (bnd) (default 300)
-overlap number: (SV only) minimum percent overlap needed to identify deletions, inversions or insertions (size) as the same (default 75)
-type sv/var: type determines which databases of a given annotation dir will be used for annotation (default var). Type var will not use sv_* annotations and type sv will not use the var_* annotations

database types

reg: regions file that must at least contain the columns chromosome,start,end. Variants are checked for overlap with regions in the file.
var: variations file that must at least contain the columns chromosome,start,end,type,ref,alt to annotate variants that match the given values. Typically, columns freq(p) and id are present for annotation. Thus, only variants that match the alleles given in alt will be annotated. If there are multiple alt alleles for the same genomic position, they should be all on the same line (unsplit) with the alt field containing a (comma separated) list of the different alt values. All information fields also contain a list with the values for the different alleles in the same order. In the typical variant file alternative alleles on the same position are split over different lines, you can use cg collapsealleles to convert. Var databases can also be a (multivalued) bcol formatted (bcol) file instead of a tsv; this is indicated by the extension bcol
gene: gene files (in gene tsv format). Variants will be annotated with the effects they have on the genes in these files as descibed below.
sv: Structural variant file that must at least contain the columns chromosome,start,end,type,ref,alt to annotate structural variants that approximately match the given values. Typically, columns freq(p) and id are present for annotation. SVs match if their respective begin positions (and end positions) differ < margin bases (lmargin for deletions and inversions, tmargin for translocations/breakends) and overlap at least overlap pct for deletions and inversions; For insertions the smaller must be at least overlap pct of the larger. breakends/translocations must link to the same chromosome at positions < tmargin bases apart.
mir: The effect of variants on miRNA genes is annotated based on a tsv file of the miRNA genes. (more detail below)
bcol: bcol databases are used to annotate positions (e.g. snps) with a given value. Database files are in the bcol format (also extension bcol).

If a database filename does not start with one of these types, it will be considered a regions database.

database parameters

If a file dbfile.opt exists, it will be scanned for database parameters. It should be a tab separated list, where each line contains a key and a value (separated by a tab)

Possible keys are:

name: this will be the base for names of added columns (in stead of extracting it from the filename)
fields: These fields will be extracted from the database and added to the annotated file in stead of the defaults (one or more of name, name2, freq and score, depending on the type and name of the database)

Gene annotation

Annotation with a gene database will add the three columns describing the effect of the variant on transcripts and resulting proteins.

dbname_impact: short code indicating impact/severity of the effect
dbname_gene: name of the gene(s) according to the database.
dbname_descr: location and extensive description of the effect(s) of the variant on each transcript

Each of the columns can contain a semicolon separated list to indicate different effects on different transcripts. If all values in such a list would be the same (e.g. gene name in case of multiple transcripts of the same gene), only this one value is shown (not a list).

Possible impact codes are:

downstream: downstream of gene (up to 2000 bases)
upstream: upstream of gene (up to 2000 bases)
intron: intronic
reg: regulatory
prom: promotor
splice: variant in splice region (3 up to 8 bases into the intron from the splice site)
RNA: in a transcript that is not coding
RNASPLICE: deletion containing at least one splice site (non-coding transcript)
UTR3: variant in the 3' UTR
UTR3SPLICE: deletion or complext variant containing at least one splice site in the 3' UTR
RNAEND: deletion containing the end of transcription
ESPLICE: essential splice site (2 bases into the intron from the splice site)
CDSsilent: variant in coding region that has no effect on the protein sequence
UTR5: variant in the 5' UTR
UTR5SPLICE: deletion or complext variant containing at least one splice site in the 5' UTR
UTR5KOZAK: variant in the 5' UTR close (6 nts) to the start codon.
RNASTART: transcription_start
CDSMIS: coding variant causing a change in the protein sequence
CDSDEL: deletion in the coding region (not affecting frame of translation)
CDSCOMP: complex variation (sub, inv, ...) in the coding region (not affecting frame of translation)
CDSINS: insertion in the coding region (not affecting frame of translation)
CDSNONSENSE: variation causing a premature stop codon in the protein sequence (nonsense)
CDSSPLICE: deletion or complext variant affecting a splice site in the coding region
CDSSTOP: change of a stop codon to a normal codon causing readthrough
CDSFRAME: indel causing a frameshift
CDSSTART: variation in the startcodon
CDSSTARTDEL: deletion affecting the startcodon
CDSSTARTCOMP: complex variation affecting the startcodon
GENEDEL: deletion (also used for sub) of whole gene
GENECOMP: complex variation (sub, inv, ...) affecting the whole gene

dbname_descr contains a description of the variant at multiple levels according to the HGVS variant nomenclature (v 15.11 http://varnomen.hgvs.org/recommendations, http://www.ncbi.nlm.nih.gov/pubmed/26931183, http://onlinelibrary.wiley.com/doi/10.1002/humu.22981/pdf). There are some (useful or necessary) deviations from from the recommendations:

The reference is allways the transcript name given in the database used; if a variant affects multiple transcripts, a separate description is given for each variant (with its own reference).
The reference is prefixed with the strand
An extra description is added (indicating the affected element e.g. which exon)
Multiple consequitive variants are not combined, e.g two consequitive substitutions (as genomecomb will usually create) are described as two separate snps instead of as the recommended delins. (One variant is adapted to shift to 3' or inserts changed to dup or rep as recommended.)
For brevity, protein changes are not parenthesised (even though they are all predictions)
For brevity, single letter AA codes are used.

The description consists of the following elements, separated by colons

transcript: name or id of the affected transcript, prefixed with a + if the transcript is in the forward strand, - for reverse strand
element and element position: element indicates the gene element the variant is located in (e.g. exon1 for the first exon). The element is followed by the relative position of the variant in the given element, separated by either a + or a -. For deletions spanning several elements, element and element position for both start and end point of the deletion are given, separated by _ - is used for the upstream element, giving the position in the upstream region relative to the start of transcription (-1 being the position just before the transcript start). + is used for all other elements, the position given is relative to the start of the element. The first base of exon2 would be given as exon2+1. (These positions are not shifted to 3' as in hgvs coding.)
DNA based description: description of the variant effect on de DNA level uses the coding (c.) or non-coding (n.) DNA reference. The genomic reference (g.) is not given as it can be easily deduced from the variant fields. This is only present if the transcript is affected (so not for up/downstream)
protein based description: description of the variant effect at the protein level (p.). This is only present if the protein is affected.

miRNA annotation

A miRNA gene file is a tsv file containing the following fields:

chromosome,begin,end,strand: indicate the location of the hairpin in the genome
name,transcript: name of the miRNA gene and transcript; Different isomiRs can be expressed from the same hairpin. These can be represented on different lines in the files as different "transcripts". It generally not necesary to give different transcript names, as the location of isomiR affected is geven in the annotation.
loopstart,loopend: begin and end genomic coordinates of the loop of the hairpin
mature1start,mature1end: genomic coordinates of the mature miRNA before the loop (vs the genomic reference). These fields can be left empty if no mature miRNA derives from that arm.
mature2start,mature2end: genomic coordinates of the mature miRNA(s) after the loop (can be empty as well).
status: (optional) field indicating the status of the miRNA gene

Based on this miRNA gene file, the genomecomb miRNA annotation adds the following fields:

dbname_impact: indicates which transcript is affected and the functional element of the miRNA the variant is in, followed by the location of the variant in this element between braces. e.g. a variant in the mature sequence, especially the seed, is more likely to have an impact on the function than one in the flank
dname_mir: name of the miRNA
dbname_status: optional field given when a status field is present in the annotation file If multiple miRNA genes are affected, the fields will contain a semi-colon separated list of impacts and genes.

Potential annotation elements are

mature5p: variant in the sequence that ends up in the mature miRNA of the 5p arm. The location of the affected isomiR in the hairpin is added after the mature5p, and if the variant is in the seed region (most important region in targetting) the word "seed" is added after the location, e.g. mature5p21_43(a+4)seed
mature3p: variant in the mature miRNA of the 3p arm. Same additions as for the mature5p are present.
loop: variant in the loop of the hairpin. Variants outside of the mature miRNA can affect the expression of the miRNA through changes in secundary structure and biogenesis.
armp5: 5' arm variant.
arm3p: 3' arm variant
flank: Variants up to 100 nts from the hairpin. These are still likely to affect the biogenesis of the miRNA.
upstream: more than 100 nts (and less than 2000 by default) before from the hairpin. If a variant affects a miRNA gene directly (up to flank), up/downstream annotations for other miRNA genes are not given.
downstream: more than 100 nts (and less than 2000 by default) after from the hairpin

The location in the element is given by a reference (a for arm, m mature, l for loop) and a number indicating how many nts the variant is located away from the reference, e.g. loop(a+2) indicates that the variant is in the loop, 2 nts away from the arm. Negative numbers are used to indicate counting from the opposite direction, e.g. arm5p(m-5) is used to indicate a variant in the 5' arm 5 nts back from the mature sequence. An e can be added to the number to indicate that the location is at either end of the given element. A deletion of the complete miRNA genes is indicated by the impact GENEDEL. Deletions spanning several (but not all) elements list the affected elements joined by &.

Home

Contact

Installation

Documentation