Annotate
Format
cg annotate ?options? variantfile resultfile dbfile ...
Summary
Annotate a variant file with region, gene or variant data
Description
Adds new columns with annotation to variantfile. Each dbfile will
add 1 or more columns to the resultfile. different types of dbfiles
are treated differently. The type is determined based on the first
part of the filename (before the first underscore). Each column will
start with a base name (the part of the filename after the last
underscore)
Arguments
- variantfile
- file in tsv format with variant data
- resultfile
- resulting file in tsv format with new
columns added
- dbfile
- files (tsv format) with features used
for annotation. If a directory is given for dbfile, all known
anotation files in this directory will be used for annotation
Options
- -near dist
- also annotate variants with the nearest feature in the dbfile
if it is closer than dist to it. A column name_dist will be
added that contains the distance. This option is only used for
region annotation
- -name namefield
- The name added as annotation normally is taken from a field
called name in the database file, or a field specified in the
database opt file. Using -name you can explicitely choose the field
to be used.
- -u upstreamsize
- The number of nts that will be considered up/downstream for
gene annotation (default 2000)
- -dbdir dbdir
- can (optionally) be used to explicitely specify the directory
containing the reference databases (The genome sequence
genome_*.ifas in it is used for gene annotation). By default the
directory the gene annotation file is in will be used. Annotation
databases in dbdir will not automatically be used to annotate by
using this option. (You will need to add dbdir as a dbfile
parameter for that)
- -replace y/n/e/a
- what to do if annotation fields to be added are already in
the variantfile: e (default) - give an error y - replace them with
new annotation if dbfile is newer, n - keep the old annotation, a -
allways replace them with new information
- -distrreg
- distribute regions for parallel processing. Possible options
are: 0: no distribution (also empty) 1: default
distribution schr or schromosome: each chromosome processed
separately chr or chromosome: each chromosome processed
separately, except the unsorted, etc. with a _ in the name that
will be combined), a number: distribution into regions of this
size a number preceded by an s: distribution into regions
targeting the given size, but breaks can only occur in unsequenced
regions of the genome (N stretches) a number preceded by an r:
distribution into regions targeting the given size, but breaks can
only occur in large (>=100000 bases) repeat regions a
number preceded by an g: distribution into regions targeting the
given size, but breaks can only occur in large (>=200000 bases)
regions without known genes a file name: the regions in the
file will be used for distribution
- -margin number
- (SV only) Allow begin and end to deviate the number of bases
given (default 30)
- -lmargin number
- (SV only) Allow begin and end to deviate the number of bases
given for deletions, inversions (default 300)
- -tmargin number
- (SV only) Allow begin and end to deviate the number of bases
given for translocations (trans) and breakends (bnd) (default 300)
- -overlap number
- (SV only) minimum percent overlap needed to identify
deletions, inversions or insertions (size) as the same (default 75)
- -type sv/var
- type determines which databases of a given annotation dir
will be used for annotation (default var). Type var will not use
sv_* annotations and type sv will not use the var_* annotations
database types
- reg
- regions file that must at least contain the columns
chromosome,start,end. Variants are checked for overlap with regions
in the file.
- var
- variations file that must at least contain the columns
chromosome,start,end,type,ref,alt to annotate variants that match
the given values. Typically, columns freq(p) and id are present for
annotation. Thus, only variants that match the alleles given in alt
will be annotated. If there are multiple alt alleles for the same
genomic position, they should be all on the same line (unsplit)
with the alt field containing a (comma separated) list of the
different alt values. All information fields also contain a list
with the values for the different alleles in the same order. In the
typical variant file alternative alleles on the same position are
split over different lines, you can use cg collapsealleles to convert.
Var databases can also be a (multivalued) bcol formatted (bcol) file instead of a tsv; this is indicated by the extension bcol
- gene
- gene files (in gene tsv format).
Variants will be annotated with the effects they have on the genes
in these files as descibed below.
- sv
- Structural variant file that must at least contain the
columns chromosome,start,end,type,ref,alt to annotate structural
variants that approximately match the given values. Typically,
columns freq(p) and id are present for annotation. SVs match if
their respective begin positions (and end positions) differ <
margin bases (lmargin for deletions and inversions,
tmargin for translocations/breakends) and overlap at least
overlap pct for deletions and inversions; For insertions the
smaller must be at least overlap pct of the larger.
breakends/translocations must link to the same chromosome at
positions < tmargin bases apart.
- mir
- The effect of variants on miRNA genes is annotated based on a
tsv file of the miRNA genes. (more detail below)
- bcol
- bcol databases are used to annotate positions (e.g. snps)
with a given value. Database files are in the bcol format (also extension bcol).
If a database filename does not start with one of these types,
it will be considered a regions database.
database parameters
If a file dbfile.opt exists, it will be scanned for database
parameters. It should be a tab separated list, where each line
contains a key and a value (separated by a tab)
Possible keys are:
- name
- this will be the base for names of added columns (in stead of
extracting it from the filename)
- fields
- These fields will be extracted from the database and added to
the annotated file in stead of the defaults (one or more of name,
name2, freq and score, depending on the type and name of the
database)
Gene annotation
Annotation with a gene database will add the three columns
describing the effect of the variant on transcripts and resulting
proteins.
- dbname_impact
- short code indicating impact/severity of the effect
- dbname_gene
- name of the gene(s) according to the database.
- dbname_descr
- location and extensive description of the effect(s) of the
variant on each transcript
Each of the columns can contain a semicolon separated list to
indicate different effects on different transcripts. If all values in
such a list would be the same (e.g. gene name in case of multiple
transcripts of the same gene), only this one value is shown (not a
list).
Possible impact codes are:
- downstream
- downstream of gene (up to 2000 bases)
- upstream
- upstream of gene (up to 2000 bases)
- intron
- intronic
- reg
- regulatory
- prom
- promotor
- splice
- variant in splice region (3 up to 8 bases into the intron
from the splice site)
- RNA
- in a transcript that is not coding
- RNASPLICE
- deletion containing at least one splice site (non-coding
transcript)
- UTR3
- variant in the 3' UTR
- UTR3SPLICE
- deletion or complext variant containing at least one splice
site in the 3' UTR
- RNAEND
- deletion containing the end of transcription
- ESPLICE
- essential splice site (2 bases into the intron from the
splice site)
- CDSsilent
- variant in coding region that has no effect on the protein
sequence
- UTR5
- variant in the 5' UTR
- UTR5SPLICE
- deletion or complext variant containing at least one splice
site in the 5' UTR
- UTR5KOZAK
- variant in the 5' UTR close (6 nts) to the start codon.
- RNASTART
- transcription_start
- CDSMIS
- coding variant causing a change in the protein sequence
- CDSDEL
- deletion in the coding region (not affecting frame of
translation)
- CDSCOMP
- complex variation (sub, inv, ...) in the coding region (not
affecting frame of translation)
- CDSINS
- insertion in the coding region (not affecting frame of
translation)
- CDSNONSENSE
- variation causing a premature stop codon in the protein
sequence (nonsense)
- CDSSPLICE
- deletion or complext variant affecting a splice site in the
coding region
- CDSSTOP
- change of a stop codon to a normal codon causing readthrough
- CDSFRAME
- indel causing a frameshift
- CDSSTART
- variation in the startcodon
- CDSSTARTDEL
- deletion affecting the startcodon
- CDSSTARTCOMP
- complex variation affecting the startcodon
- GENEDEL
- deletion (also used for sub) of whole gene
- GENECOMP
- complex variation (sub, inv, ...) affecting the whole gene
dbname_descr contains a description of the variant at
multiple levels according to the HGVS variant nomenclature (v 15.11 http://varnomen.hgvs.org/recommendations, http://www.ncbi.nlm.nih.gov/pubmed/26931183, http://onlinelibrary.wiley.com/doi/10.1002/humu.22981/pdf).
There are some (useful or necessary) deviations from from the
recommendations:
- The reference is allways the transcript name given in the
database used; if a variant affects multiple transcripts, a
separate description is given for each variant (with its own
reference).
- The reference is prefixed with the strand
- An extra description is added (indicating the affected
element e.g. which exon)
- Multiple consequitive variants are not combined, e.g two
consequitive substitutions (as genomecomb will usually create) are
described as two separate snps instead of as the recommended
delins. (One variant is adapted to shift to 3' or inserts changed
to dup or rep as recommended.)
- For brevity, protein changes are not parenthesised (even
though they are all predictions)
- For brevity, single letter AA codes are used.
The description consists of the following elements, separated by
colons
- transcript
- name or id of the affected transcript, prefixed with a + if
the transcript is in the forward strand, - for reverse strand
- element and element position
- element indicates the gene element the variant is located in
(e.g. exon1 for the first exon). The element is followed by the
relative position of the variant in the given element, separated by
either a + or a -. For deletions spanning several elements, element
and element position for both start and end point of the deletion
are given, separated by _ - is used for the upstream element,
giving the position in the upstream region relative to the start of
transcription (-1 being the position just before the transcript
start). + is used for all other elements, the position given is
relative to the start of the element. The first base of exon2 would
be given as exon2+1. (These positions are not shifted to 3' as in
hgvs coding.)
- DNA based description
- description of the variant effect on de DNA level uses the
coding (c.) or non-coding (n.) DNA reference. The genomic reference
(g.) is not given as it can be easily deduced from the variant
fields. This is only present if the transcript is affected (so not
for up/downstream)
- protein based description
- description of the variant effect at the protein level (p.).
This is only present if the protein is affected.
miRNA annotation
A miRNA gene file is a tsv file containing the following fields:
- chromosome,begin,end,strand
- indicate the location of the hairpin in the genome
- name,transcript
- name of the miRNA gene and transcript; Different isomiRs can
be expressed from the same hairpin. These can be represented on
different lines in the files as different "transcripts".
It generally not necesary to give different transcript names, as
the location of isomiR affected is geven in the annotation.
- loopstart,loopend
- begin and end genomic coordinates of the loop of the hairpin
- mature1start,mature1end
- genomic coordinates of the mature miRNA before the loop (vs
the genomic reference). These fields can be left empty if no mature
miRNA derives from that arm.
- mature2start,mature2end
- genomic coordinates of the mature miRNA(s) after the loop
(can be empty as well).
- status
- (optional) field indicating the status of the miRNA gene
Based on this miRNA gene file, the genomecomb miRNA annotation
adds the following fields:
- dbname_impact
- indicates which transcript is affected and the functional
element of the miRNA the variant is in, followed by the location of
the variant in this element between braces. e.g. a variant in the
mature sequence, especially the seed, is more likely to have an
impact on the function than one in the flank
- dname_mir
- name of the miRNA
- dbname_status
- optional field given when a status field is present in the
annotation file If multiple miRNA genes are affected, the fields
will contain a semi-colon separated list of impacts and genes.
Potential annotation elements are
- mature5p
- variant in the sequence that ends up in the mature miRNA of
the 5p arm. The location of the affected isomiR in the hairpin is
added after the mature5p, and if the variant is in the seed region
(most important region in targetting) the word "seed" is
added after the location, e.g. mature5p21_43(a+4)seed
- mature3p
- variant in the mature miRNA of the 3p arm. Same additions as
for the mature5p are present.
- loop
- variant in the loop of the hairpin. Variants outside of the
mature miRNA can affect the expression of the miRNA through changes
in secundary structure and biogenesis.
- armp5
- 5' arm variant.
- arm3p
- 3' arm variant
- flank
- Variants up to 100 nts from the hairpin. These are still
likely to affect the biogenesis of the miRNA.
- upstream
- more than 100 nts (and less than 2000 by default) before from
the hairpin. If a variant affects a miRNA gene directly (up to
flank), up/downstream annotations for other miRNA genes are not
given.
- downstream
- more than 100 nts (and less than 2000 by default) after from
the hairpin
The location in the element is given by a reference (a for arm,
m mature, l for loop) and a number indicating how many nts the
variant is located away from the reference, e.g. loop(a+2) indicates
that the variant is in the loop, 2 nts away from the arm. Negative
numbers are used to indicate counting from the opposite direction,
e.g. arm5p(m-5) is used to indicate a variant in the 5' arm 5 nts
back from the mature sequence. An e can be added to the number to
indicate that the location is at either end of the given element. A
deletion of the complete miRNA genes is indicated by the impact
GENEDEL. Deletions spanning several (but not all) elements list the
affected elements joined by &.
Category
Annotation