GenomeComb

Pmulticompar

Format

cg pmulticompar ?options? multicomparfile sampledir/varfile ...

Summary

Compare multiple variant files

Description

This command combines multiple variant files into one multicompar file: a tab separated file containing a wide table used to compare variants between different samples. Each line in the table contains general variant info (position, type, annotation, etc.) and columns with variant info specific to each sample (genotype, coverage, etc.). The latter have a column heading of the form field-<sample>.

The <sample> name used is extracted from the filename of the variant files by removing the file extension and the start of the filename up to and including the first -. (Variant files are expected to use the following convention: var-<sample>.tsv.) <sample> can be simply the samplename, but may also include information about e.g. the sequencing or analysis method in the form method-samplename, e.g. var-gatk-sample1.tsv and var-sam-sample1.tsv to indicate variants called using gatk and samtools respectively for the same sample (sample1), thus allowing comparison of different methods, etc.

Completing missing information

When some samples contain a variant that others do not, the information in the multicompar file cannot be complete by combining variant files alone: A variant missing in one variant file may either mean that it is reference or that it is not sequenced. Other information about coverage, etc. on the variant not present in the variant file will also be missing. multicompar will try to complete this missing info based on files (if present) accompanying the variant file (in the same directory and having a similar filename). Folowing files can be used for completing multicompar data:

If data can not be completed using accompanying files, the missing information will be indicated by a questing mark as a value in the table.

Adding to an existing multicomnpar

Using the same command, you can add new samples to an existing multicompar file. pmulticompar needs the original variant files and accompanying files for the existing multicompar file. It can find these files (based on filename) easily if everything is organised in a projectdir (hint: You can add samples in another projectdir without duplication using symbolic links) pmulticompar will look however in various places relative to the multicompar file (same directory, samples directory in same dir, parent directory, samples directory in parent dir).

compared to old multicompar

cg pmulticompar is a parallel and much faster version of multicompar. It also integrates the reannot step in one. It is more finicky about e.g. directory structure (needs to find the original sample variant files) and does not support all accompanying information files that the old one does.

Arguments

multicomparfile
resultfile, will be created if it does not exist
sampledir/varfile
directory or file containing variants of new sample to be added. More than one can added in one command If a sample directory is given, all files in it of the format var-<sample>.tsv (or fannotvar-<sample>.tsv) will be added as variant files.

Options

-r 0/1 (--reannotregonly)
Also do reannotation, but only region data (sreg) will be updated
-t variantfile (--targetvarsfile)
All variants in variantfile will be in the final multicompar, even if they are not present in any of the samples. A column will be added to indicate for each variant if it was in the targetvarsfile. The name of the column will be the part of the filename after the last dash. If variantfile contains a "name" column, the content of this will be used in the targets column instead of a 1.
-s 0/1 (-split --split)
if 1 (default), multiple alternative alleles will be on a separate lines, treated mostly as a separate variant Use 0 for giving alternatives alleles on the same line, separated by comma in the alt field
-e 0/1 (--erroronduplicates)
if 1, an error is given if one of the samples to be added is alread present in the multicompar file. The default (0) is to skip these.
-i 0/1 (--skipincomplete)
if set to 0, pmulticompar will stop with an error if no sreg or varall file is found for a sample. If 1 (default), it will give a warning, but continue making the multicompar file (incomplete for this sample)
-keepfields fieldlist
Besides the obligatory fields, include only the fields in fieldlist (space separated) in the resulting multicompar. Default is to use all fields present in the file (*)
-m maxopenfiles (-maxopenfiles)
The number of files that a program can keep open at the same time is limited. pmulticompar will distribute the subtasks thus, that the number of files open at the same time stays below this number. With this option, the maximum number of open files can be set manually (if the program e.g. does not deduce the proper limit, or you want to affect the distribution).
-limitreg regionfile
limit the processing to only variants overlapping the regions given in regionfile
-force 0/1
pmulticompar creates a lot of temporary files (in the .temp dir). Even if some of these exist their recreation is forced by default (for safety). If a pmulticompar was interupted, you can use -force 0 to reuse the temp files already used instead of recreating them.

This command can be distributed on a cluster or using multiple with job options (more info with cg help joboptions)

Category

Variants