Vcf2tsv

Format

cg vcf2tsv ?options? ?infile? ?outfile?

Summary

Converts data in vcf format to genomecomb tab-separated variant file (tsv). The command will also sort the tsv appropriately.

Description

cg vcf2tsv converts a vcf file to a tab-separated variant file (tsv). The header section of the vcf is converted to the genomecomb header conventions in the tsv file. The fields describing the variant (chromosome, begin, end, type, ref, alt) are in the normal genomecomb conventions. The fields ID and QUAL in the vcf file are in the tsv as "name" and "quality" respectively. If genotype data is present, the normal genocomb fields (alleleSeq1, alleleSeq2, zyg, phased,genotypes) are added.

All other information in the FORMAT or INFO data is converted into extra columns in the result file. These get the ID code in the original vcf file as a name, except for the following (common) fields that get a longer (more informative) name:

AD: alleledepth (Allelic depths for the ref and alt alleles in the order listed)
GT: genotype
DP in INFO: totalcoverage (Total Depth, counting all reads)
DP in FORMAT: coverage (Read Depth, counting only filtered reads used for calling, and only from one sample)
FT: gfilter
GL: loglikelihood (three floating point log10-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic)
GQ: genoqual (genotype quality, encoded as a phred quality -10log_10p(genotype call is wrong))
PS: phaseset (integer indicating the haplotype set the phased genotype belongs to)
HQ: haploqual (haplotype qualities, two phred qualities comma separated)
AN: totalallelecount (total number of alleles in called genotypes)
AC: allelecount (allele count in genotypes, for each alt allele, in the same order as listed)
AF: frequency (allele frequency for each alt allele in the same order as listed)
AA: Ancestralallele
DB: dbsnp
H2: Hapmap2

In vcf files variants next to each other (such as e.g. a snp followed by a deletion) are described together, basically as a substition with several alleles. By default vcf2tsv will split these up into the separate types (and alleles) and adapt the resulting variant lines accordingly as far as possible (some fields contain lists correlated with the alleles). The way of handling this is set by the -split option

ori: Using ori for -split will recreate the orignal setup, creating exactly one line for each line in the vcf file. A combined variant line will be converted to a variant of type sub. For all fields the correlations with alleles will stay correct, but querying and annotation will be harder (e.g. missing a common snp because it is combined into a sub with an indel).
1: Each alternative allele will be on a seperate line. (split version of tsv format)
0: Different types will be on a seperate lines, but multiple alleles are on the same line. (multiallelic version of tsv format)

The vcf fields containing lists that have to be handled specially are indicated in the vcf file with:

Number=A: These contain a list of values corresponding to the alternative alleles in the vcf. Each value will be assigned (as a single value) to their proper allele.
Number=R: These contain a list of values corresponding to all alleles, starting with the reference allele and then the alternatives. An extra field (fieldname_ref) will be created that contains reference value, and the other values are assigned to their proper allele.
Number=G: These fields contain a list of values for all potential genotypes. They cannot be properly split up to the individual alleles (especially as the alleles may end up as different types). They are transfered as is, but the correlation in the resulting file may be wrong.
Number=.: Unspecified; by default they are left as is, but in the results of some programs they are related to alleles, either as A or R. You can use the -typelist option to specify what to do with them.

Arguments

infile: file to be converted, if not given, uses stdin. File may be compressed.
outfile: write results to outfile, if not given, uses stdout

Options

-split 0/1/ori: produce a tsv with split (1), multiallelic (0) alleles or keep the original layout
-sort 0/1: By default (1) cg vcf2tsv will sort the file during conversion. Explicit sorting is not always needed (e.g. if the vcf is sorted and uses a natural sort order for chromosomes, or if sorting will happen later in the workflow anyway) and can be turned of using -sort 0 to save processing time.
-t typelist (-typelist): Determines what to do with fields indicated with Number=. in the vcf. The first character indicates how to deal by default with such a field (R, A, to distribute over alleles or . to just copy the list). Following this can be a (space separated) list of fieldnames and how to handle them. (This will only be applied if the given field is specified as Number=.) The default typelist is ". AD R RPA R AC A AF A", including some fields which are commonly defined this way.
-keepfields fieldlist: Besides the obligatory fields, include only the fields in fieldlist (space separated) in the output. Default is to use all fields present in the file (*)
-locerror error/keep/correct: some vcfs contain locations that would be incorrect (lead to problems for annotation etc.) in a tsv (e.g. end < begin). By default vcf2tsv will stop with an error on these (error). Use keep to continue producing a tsv file including these wrong entries, while correct will produce a tsv file where this error is corrected (end changed to = begin).
-meta: list of key value pairs that will be added to meta data in the comment lines

Home

Contact

Installation