long

Format

cg long ?infile? ?outfile?

Summary

Converts data in tsv format from wide format (data for each sample/analysis in separate columns) to long format (data for each sample/analysis in separate lines)

Description

Genomecomb usually expects data in tsv files in wide format, where some fields are common/general for all samples (e.g. chromosome, begin, end, .. for variants) and some fields have specific values for each sample (e.g. quality, coverage, ...). In wide format these columns are named by adding "-samplename" to the fieldname. In the long format, sample specific data is on separate lines (the common fields are repeated in each line). "cg long" converts from wide to long format. "cg wide" can be used for the reverse.

Arguments

infile: file to be converted, if not given, uses stdin. File may be compressed.
outfile: write results to outfile, if not given, uses stdout

Options

-type sample/analysis: determines whether lines are separated on sample (part after last -) or analysis (part after first -) e.g.

var   zyg-gatk-bwa-sample1    zyg-sam-bwa-sample1 zyg-gatk-bwa-sample2
test    v   u   r

using sample this will be split to

var sample  zyg-gatk-bwa    zyg-sam-bwa zyg-gatk-bwa-sample2
test    sample1 v   u
test    sample2 r

using analysis this will be split to

var sample  zyg
test    gatk-bwa-sample1    v
test    gatk-sam-sample1    u
test    gatk-bwa-sample2    r

Even though the analysis is now used, the name of the column is still sample! (to allow the use of sampleinfo files)

-norm 1/0: If 1, output "normalized" long tables: The normal output will contain only variant data with a numeric id, without the data specific for each sample. An extra tsv file is generated (<outfile>.sampledata.tsv) that contains the sample specific data, one line per sample (identified by the id) and sample (identified by a "sample" column)
-samplefields samplefields: Using this option, sample names that consist of several parts separated by - (as typical in genomecomb output, e.g. -gatk-rdsbwa-sample1) are split up in separate fields with the field names given by samplefields

Home

Contact

Installation