Pmulticompar
Format
cg pmulticompar ?options? multicomparfile sampledir/varfile ...
Summary
Compare multiple variant files
Description
This command combines multiple variant files into one
multicompar file: a tab separated file containing a wide table
used to compare variants between different samples. Each line in the
table contains general variant info (position, type, annotation,
etc.) and columns with variant info specific to each sample
(genotype, coverage, etc.). The latter have a column heading of the
form field-<sample>.
The <sample> name used is extracted from the filename
of the variant files by removing the file extension and the start of
the filename up to and including the first -. (Variant files are
expected to use the following convention: var-<sample>.tsv.)
<sample> can be simply the samplename, but may also include
information about e.g. the sequencing or analysis method in the form
method-samplename, e.g. var-gatk-sample1.tsv and var-sam-sample1.tsv
to indicate variants called using gatk and samtools respectively for
the same sample (sample1), thus allowing comparison of different
methods, etc.
Completing missing information
When some samples contain a variant that others do not, the
information in the multicompar file cannot be complete by combining
variant files alone: A variant missing in one variant file may either
mean that it is reference or that it is not sequenced. Other
information about coverage, etc. on the variant not present in the
variant file will also be missing. multicompar will try to
complete this missing info based on files (if present)
accompanying the variant file (in the same directory and having a
similar filename). Folowing files can be used for completing
multicompar data:
- sreg-<sample>.tsv: a tab-separated region file indicating
all regions that are sequenced. If the variant is located in such a
region, the sequenced-sample value will be set to r (for
reference). Outside of these regions it will be annotated as u
(unsequenced).
- varall-<sample>.tsv: an tsv file made by returning
"variant" calling results for all positions. This must
contain the same columns as the variant file. All fields can be
completed from a varall file. If a variant is not in the varall
file, it is set to u (unsequenced).
If data can not be completed using accompanying files, the
missing information will be indicated by a questing mark as a value
in the table.
Adding to an existing multicomnpar
Using the same command, you can add new samples to an existing
multicompar file. pmulticompar needs the original variant files and
accompanying files for the existing multicompar file. It can find
these files (based on filename) easily if everything is organised in
a projectdir (hint: You can add samples
in another projectdir without duplication using symbolic links)
pmulticompar will look however in various places relative to the
multicompar file (same directory, samples directory in same dir,
parent directory, samples directory in parent dir).
compared to old multicompar
cg pmulticompar is a parallel and much faster version of multicompar. It also integrates the
reannot step in one. It is more finicky about e.g. directory
structure (needs to find the original sample variant files) and does
not support all accompanying information files that the old one does.
Arguments
- multicomparfile
- resultfile, will be created if it does not exist
- sampledir/varfile
- directory or file containing variants of new sample to be
added. More than one can added in one command If a sample directory
is given, all files in it of the format var-<sample>.tsv (or
fannotvar-<sample>.tsv) will be added as variant files.
Options
- -r 0/1 (--reannotregonly)
- Also do reannotation, but only region data (sreg) will be
updated
- -t variantfile (--targetvarsfile)
- All variants in variantfile will be in the final
multicompar, even if they are not present in any of the samples. A
column will be added to indicate for each variant if it was in the
targetvarsfile. The name of the column will be the part of the
filename after the last dash. If variantfile contains a
"name" column, the content of this will be used in the
targets column instead of a 1.
- -s 0/1 (-split --split)
- if 1 (default), multiple alternative alleles will be on a
separate lines, treated mostly as a separate variant Use 0 for
giving alternatives alleles on the same line, separated by comma in
the alt field
- -e 0/1 (--erroronduplicates)
- if 1, an error is given if one of the samples to be added is
alread present in the multicompar file. The default (0) is to skip
these.
- -i 0/1 (--skipincomplete)
- if set to 0, pmulticompar will stop with an error if no sreg
or varall file is found for a sample. If 1 (default), it will give
a warning, but continue making the multicompar file (incomplete for
this sample)
- -keepfields fieldlist
- Besides the obligatory fields, include only the fields in
fieldlist (space separated) in the resulting multicompar. Default
is to use all fields present in the file (*)
- -m maxopenfiles (-maxopenfiles)
- The number of files that a program can keep open at the same
time is limited. pmulticompar will distribute the subtasks thus,
that the number of files open at the same time stays below this
number. With this option, the maximum number of open files can be
set manually (if the program e.g. does not deduce the proper limit,
or you want to affect the distribution).
- -limitreg regionfile
- limit the processing to only variants overlapping the regions
given in regionfile
- -force 0/1
- pmulticompar creates a lot of temporary files (in the .temp
dir). Even if some of these exist their recreation is forced by
default (for safety). If a pmulticompar was interupted, you can use
-force 0 to reuse the temp files already used instead of recreating
them.
This command can be distributed on a cluster or using multiple
with job options (more info with cg
help joboptions)
Category
Variants