Vcf2tsv
Format
cg vcf2tsv ?options? ?infile? ?outfile?
Summary
Converts data in vcf format to genomecomb tab-separated variant
file (tsv). The command will also sort the tsv
appropriately.
Description
cg vcf2tsv converts a vcf file to a tab-separated variant file (tsv). The header section of the vcf is converted
to the genomecomb header conventions in the tsv file. The fields
describing the variant (chromosome, begin, end, type, ref, alt) are
in the normal genomecomb conventions. The fields ID and QUAL in the
vcf file are in the tsv as "name" and "quality"
respectively. If genotype data is present, the normal genocomb fields
(alleleSeq1, alleleSeq2, zyg, phased,genotypes) are added.
All other information in the FORMAT or INFO data is converted into
extra columns in the result file. These get the ID code in the
original vcf file as a name, except for the following (common) fields
that get a longer (more informative) name:
- AD
- alleledepth (Allelic depths for the ref and alt alleles in
the order listed)
- GT
- genotype
- DP in INFO
- totalcoverage (Total Depth, counting all reads)
- DP in FORMAT
- coverage (Read Depth, counting only filtered reads used for
calling, and only from one sample)
- FT
- gfilter
- GL
- loglikelihood (three floating point log10-scaled likelihoods
for AA,AB,BB genotypes where A=ref and B=alt; not applicable if
site is not biallelic)
- GQ
- genoqual (genotype quality, encoded as a phred quality
-10log_10p(genotype call is wrong))
- PS
- phaseset (integer indicating the haplotype set the phased
genotype belongs to)
- HQ
- haploqual (haplotype qualities, two phred qualities comma
separated)
- AN
- totalallelecount (total number of alleles in called
genotypes)
- AC
- allelecount (allele count in genotypes, for each alt allele,
in the same order as listed)
- AF
- frequency (allele frequency for each alt allele in the same
order as listed)
- AA
- Ancestralallele
- DB
- dbsnp
- H2
- Hapmap2
In vcf files variants next to each other (such as e.g. a snp
followed by a deletion) are described together, basically as a
substition with several alleles. By default vcf2tsv will split these
up into the separate types (and alleles) and adapt the resulting
variant lines accordingly as far as possible (some fields contain
lists correlated with the alleles). The way of handling this is set
by the -split option
- ori
- Using ori for -split will recreate the orignal setup,
creating exactly one line for each line in the vcf file. A combined
variant line will be converted to a variant of type sub. For all
fields the correlations with alleles will stay correct, but
querying and annotation will be harder (e.g. missing a common snp
because it is combined into a sub with an indel).
- 1
- Each alternative allele will be on a seperate line. (split
version of tsv format)
- 0
- Different types will be on a seperate lines, but multiple
alleles are on the same line. (multiallelic version of tsv format)
The vcf fields containing lists that have to be handled
specially are indicated in the vcf file with:
- Number=A
- These contain a list of values corresponding to the
alternative alleles in the vcf. Each value will be assigned (as a
single value) to their proper allele.
- Number=R
- These contain a list of values corresponding to all alleles,
starting with the reference allele and then the alternatives. An
extra field (fieldname_ref) will be created that contains reference
value, and the other values are assigned to their proper allele.
- Number=G
- These fields contain a list of values for all potential
genotypes. They cannot be properly split up to the individual
alleles (especially as the alleles may end up as different types).
They are transfered as is, but the correlation in the resulting
file may be wrong.
- Number=.
- Unspecified; by default they are left as is, but in the
results of some programs they are related to alleles, either as A
or R. You can use the -typelist option to specify what to do with
them.
Arguments
- infile
- file to be converted, if not given, uses stdin. File may be
compressed.
- outfile
- write results to outfile, if not given, uses stdout
Options
- -split 0/1/ori
- produce a tsv with split (1),
multiallelic (0) alleles or keep the original layout
- -sort 0/1
- By default (1) cg vcf2tsv will sort the file during
conversion. Explicit sorting is not always needed (e.g. if the vcf
is sorted and uses a natural sort order for chromosomes, or if
sorting will happen later in the workflow anyway) and can be turned
of using -sort 0 to save processing time.
- -t typelist (-typelist)
- Determines what to do with fields indicated with Number=. in
the vcf. The first character indicates how to deal by default with
such a field (R, A, to distribute over alleles or . to just copy
the list). Following this can be a (space separated) list of
fieldnames and how to handle them. (This will only be applied if
the given field is specified as Number=.) The default typelist is
". AD R RPA R AC A AF A", including some fields which are
commonly defined this way.
- -keepfields fieldlist
- Besides the obligatory fields, include only the fields in
fieldlist (space separated) in the output. Default is to use all
fields present in the file (*)
- -locerror error/keep/correct
- some vcfs contain locations that would be incorrect (lead to
problems for annotation etc.) in a tsv (e.g. end < begin). By
default vcf2tsv will stop with an error on these (error).
Use keep to continue producing a tsv file including these
wrong entries, while correct will produce a tsv file where
this error is corrected (end changed to = begin).
- -meta
- list of key value pairs that will be added to meta data in
the comment lines
Category
Format Conversion