GenomeComb

Annotate

Format

cg annotate ?options? variantfile resultfile dbfile ...

Summary

Annotate a variant file with region, gene or variant data

Description

Adds new columns with annotation to variantfile. Each dbfile will add 1 or more columns to the resultfile. different types of dbfiles are treated differently. The type is determined based on the first part of the filename (before the first underscore). Each column will start with a base name (the part of the filename after the last underscore)

Arguments

variantfile
file in tsv format with variant data
resultfile
resulting file in tsv format with new columns added
dbfile
files (tsv format) with features used for annotation. If a directory is given for dbfile, all known anotation files in this directory will be used for annotation

Options

-near dist
also annotate variants with the nearest feature in the dbfile if it is closer than dist to it. A column name_dist will be added that contains the distance. This option is only used for region annotation
-name namefield
The name added as annotation normally is taken from a field called name in the database file, or a field specified in the database opt file. Using -name you can explicitely choose the field to be used.
-u upstreamsize
The number of nts that will be considered up/downstream for gene annotation (default 2000)
-dbdir dbdir
can (optionally) be used to explicitely specify the directory containing the reference databases (The genome sequence genome_*.ifas in it is used for gene annotation). By default the directory the gene annotation file is in will be used. Annotation databases in dbdir will not automatically be used to annotate by using this option. (You will need to add dbdir as a dbfile parameter for that)
-replace y/n/e/a
what to do if annotation fields to be added are already in the variantfile: e (default) - give an error y - replace them with new annotation if dbfile is newer, n - keep the old annotation, a - allways replace them with new information
-distrreg
distribute regions for parallel processing. Possible options are: 0: no distribution (also empty) 1: default distribution schr or schromosome: each chromosome processed separately chr or chromosome: each chromosome processed separately, except the unsorted, etc. with a _ in the name that will be combined), a number: distribution into regions of this size a number preceded by an s: distribution into regions targeting the given size, but breaks can only occur in unsequenced regions of the genome (N stretches) a number preceded by an r: distribution into regions targeting the given size, but breaks can only occur in large (>=100000 bases) repeat regions a number preceded by an g: distribution into regions targeting the given size, but breaks can only occur in large (>=200000 bases) regions without known genes a file name: the regions in the file will be used for distribution
-margin number
(SV only) Allow begin and end to deviate the number of bases given (default 30)
-lmargin number
(SV only) Allow begin and end to deviate the number of bases given for deletions, inversions (default 300)
-tmargin number
(SV only) Allow begin and end to deviate the number of bases given for translocations (trans) and breakends (bnd) (default 300)
-overlap number
(SV only) minimum percent overlap needed to identify deletions, inversions or insertions (size) as the same (default 75)
-type sv/var
type determines which databases of a given annotation dir will be used for annotation (default var). Type var will not use sv_* annotations and type sv will not use the var_* annotations

database types

reg
regions file that must at least contain the columns chromosome,start,end. Variants are checked for overlap with regions in the file.
var
variations file that must at least contain the columns chromosome,start,end,type,ref,alt to annotate variants that match the given values. Typically, columns freq(p) and id are present for annotation. Thus, only variants that match the alleles given in alt will be annotated. If there are multiple alt alleles for the same genomic position, they should be all on the same line (unsplit) with the alt field containing a (comma separated) list of the different alt values. All information fields also contain a list with the values for the different alleles in the same order. In the typical variant file alternative alleles on the same position are split over different lines, you can use cg collapsealleles to convert. Var databases can also be a (multivalued) bcol formatted (bcol) file instead of a tsv; this is indicated by the extension bcol
gene
gene files (in gene tsv format). Variants will be annotated with the effects they have on the genes in these files as descibed below.
sv
Structural variant file that must at least contain the columns chromosome,start,end,type,ref,alt to annotate structural variants that approximately match the given values. Typically, columns freq(p) and id are present for annotation. SVs match if their respective begin positions (and end positions) differ < margin bases (lmargin for deletions and inversions, tmargin for translocations/breakends) and overlap at least overlap pct for deletions and inversions; For insertions the smaller must be at least overlap pct of the larger. breakends/translocations must link to the same chromosome at positions < tmargin bases apart.
mir
The effect of variants on miRNA genes is annotated based on a tsv file of the miRNA genes. (more detail below)
bcol
bcol databases are used to annotate positions (e.g. snps) with a given value. Database files are in the bcol format (also extension bcol).

If a database filename does not start with one of these types, it will be considered a regions database.

database parameters

If a file dbfile.opt exists, it will be scanned for database parameters. It should be a tab separated list, where each line contains a key and a value (separated by a tab)

Possible keys are:

name
this will be the base for names of added columns (in stead of extracting it from the filename)
fields
These fields will be extracted from the database and added to the annotated file in stead of the defaults (one or more of name, name2, freq and score, depending on the type and name of the database)

Gene annotation

Annotation with a gene database will add the three columns describing the effect of the variant on transcripts and resulting proteins.

dbname_impact
short code indicating impact/severity of the effect
dbname_gene
name of the gene(s) according to the database.
dbname_descr
location and extensive description of the effect(s) of the variant on each transcript

Each of the columns can contain a semicolon separated list to indicate different effects on different transcripts. If all values in such a list would be the same (e.g. gene name in case of multiple transcripts of the same gene), only this one value is shown (not a list).

Possible impact codes are:

downstream
downstream of gene (up to 2000 bases)
upstream
upstream of gene (up to 2000 bases)
intron
intronic
reg
regulatory
prom
promotor
splice
variant in splice region (3 up to 8 bases into the intron from the splice site)
RNA
in a transcript that is not coding
RNASPLICE
deletion containing at least one splice site (non-coding transcript)
UTR3
variant in the 3' UTR
UTR3SPLICE
deletion or complext variant containing at least one splice site in the 3' UTR
RNAEND
deletion containing the end of transcription
ESPLICE
essential splice site (2 bases into the intron from the splice site)
CDSsilent
variant in coding region that has no effect on the protein sequence
UTR5
variant in the 5' UTR
UTR5SPLICE
deletion or complext variant containing at least one splice site in the 5' UTR
UTR5KOZAK
variant in the 5' UTR close (6 nts) to the start codon.
RNASTART
transcription_start
CDSMIS
coding variant causing a change in the protein sequence
CDSDEL
deletion in the coding region (not affecting frame of translation)
CDSCOMP
complex variation (sub, inv, ...) in the coding region (not affecting frame of translation)
CDSINS
insertion in the coding region (not affecting frame of translation)
CDSNONSENSE
variation causing a premature stop codon in the protein sequence (nonsense)
CDSSPLICE
deletion or complext variant affecting a splice site in the coding region
CDSSTOP
change of a stop codon to a normal codon causing readthrough
CDSFRAME
indel causing a frameshift
CDSSTART
variation in the startcodon
CDSSTARTDEL
deletion affecting the startcodon
CDSSTARTCOMP
complex variation affecting the startcodon
GENEDEL
deletion (also used for sub) of whole gene
GENECOMP
complex variation (sub, inv, ...) affecting the whole gene

dbname_descr contains a description of the variant at multiple levels according to the HGVS variant nomenclature (v 15.11 http://varnomen.hgvs.org/recommendations, http://www.ncbi.nlm.nih.gov/pubmed/26931183, http://onlinelibrary.wiley.com/doi/10.1002/humu.22981/pdf). There are some (useful or necessary) deviations from from the recommendations:

The description consists of the following elements, separated by colons

transcript
name or id of the affected transcript, prefixed with a + if the transcript is in the forward strand, - for reverse strand
element and element position
element indicates the gene element the variant is located in (e.g. exon1 for the first exon). The element is followed by the relative position of the variant in the given element, separated by either a + or a -. For deletions spanning several elements, element and element position for both start and end point of the deletion are given, separated by _ - is used for the upstream element, giving the position in the upstream region relative to the start of transcription (-1 being the position just before the transcript start). + is used for all other elements, the position given is relative to the start of the element. The first base of exon2 would be given as exon2+1. (These positions are not shifted to 3' as in hgvs coding.)
DNA based description
description of the variant effect on de DNA level uses the coding (c.) or non-coding (n.) DNA reference. The genomic reference (g.) is not given as it can be easily deduced from the variant fields. This is only present if the transcript is affected (so not for up/downstream)
protein based description
description of the variant effect at the protein level (p.). This is only present if the protein is affected.

miRNA annotation

A miRNA gene file is a tsv file containing the following fields:

chromosome,begin,end,strand
indicate the location of the hairpin in the genome
name,transcript
name of the miRNA gene and transcript; Different isomiRs can be expressed from the same hairpin. These can be represented on different lines in the files as different "transcripts". It generally not necesary to give different transcript names, as the location of isomiR affected is geven in the annotation.
loopstart,loopend
begin and end genomic coordinates of the loop of the hairpin
mature1start,mature1end
genomic coordinates of the mature miRNA before the loop (vs the genomic reference). These fields can be left empty if no mature miRNA derives from that arm.
mature2start,mature2end
genomic coordinates of the mature miRNA(s) after the loop (can be empty as well).
status
(optional) field indicating the status of the miRNA gene

Based on this miRNA gene file, the genomecomb miRNA annotation adds the following fields:

dbname_impact
indicates which transcript is affected and the functional element of the miRNA the variant is in, followed by the location of the variant in this element between braces. e.g. a variant in the mature sequence, especially the seed, is more likely to have an impact on the function than one in the flank
dname_mir
name of the miRNA
dbname_status
optional field given when a status field is present in the annotation file If multiple miRNA genes are affected, the fields will contain a semi-colon separated list of impacts and genes.

Potential annotation elements are

mature5p
variant in the sequence that ends up in the mature miRNA of the 5p arm. The location of the affected isomiR in the hairpin is added after the mature5p, and if the variant is in the seed region (most important region in targetting) the word "seed" is added after the location, e.g. mature5p21_43(a+4)seed
mature3p
variant in the mature miRNA of the 3p arm. Same additions as for the mature5p are present.
loop
variant in the loop of the hairpin. Variants outside of the mature miRNA can affect the expression of the miRNA through changes in secundary structure and biogenesis.
armp5
5' arm variant.
arm3p
3' arm variant
flank
Variants up to 100 nts from the hairpin. These are still likely to affect the biogenesis of the miRNA.
upstream
more than 100 nts (and less than 2000 by default) before from the hairpin. If a variant affects a miRNA gene directly (up to flank), up/downstream annotations for other miRNA genes are not given.
downstream
more than 100 nts (and less than 2000 by default) after from the hairpin

The location in the element is given by a reference (a for arm, m mature, l for loop) and a number indicating how many nts the variant is located away from the reference, e.g. loop(a+2) indicates that the variant is in the loop, 2 nts away from the arm. Negative numbers are used to indicate counting from the opposite direction, e.g. arm5p(m-5) is used to indicate a variant in the 5' arm 5 nts back from the mature sequence. An e can be added to the number to indicate that the location is at either end of the given element. A deletion of the complete miRNA genes is indicated by the impact GENEDEL. Deletions spanning several (but not all) elements list the affected elements joined by &.

Category

Annotation