GenomeComb
GenomeComb provides tools to analyze, combine, annotate and query whole genome, exome or targetted sequencing data as well as transcriptome data. Variant files in tab-separated format from different sequencing datasets can be generated and combined taking into account which regions are actually sequenced (given as region files in tab-separated format), annotated and queried (several examples can be seen in the Howto_Query section). A graphical user interface able to browse and query multi-million line tab-separated files is also included.
The cg process_project command provides a pipeline to generate annotated multisample variant data, reports, etc. starting from various raw data source material (e.g. fastq files,Complete Genomics data), which can be run locally (optionally using multiple cores) or distributed on a cluster. It combines many of the genomecomb commands that are also available separately (reference)
While genomecomb understands and produces most of the typical formats used in ngs analysis (bam/cram, vcf, bed, ...), the central, standard file format used in GenomeComb is the widely supported, simple, yet flexible tab-separated values file (format_tsv). This text format contains tabular data, where each line is a record, and each field is separated from the next by a TAB character. The first line (not starting with a #) is a header indicating the names of each column (or field). Lines starting with a # preceeding the header are comments (and may store metadata on the file). The file extension .tsv can be used to refer to this format.
Depending on which columns are present, tsv files can be used for various purposes. Usually the files are used to describe features on a reference genome sequence. In this intro a number of basic fields and uses are described. Refer to the format_tsv help for more in depth info on the format and its uses. Some typical fields are:
Most tools expect the tsv files to be sorted on chromosome,begin,end,type and will create sorted files. You can sort files using the -s option of cg select. Not all the columns must be present, and any other columns can be added and searched. In files containing data for multiple samples, columns that are specific to a sample have -samplename appended to the column name. Some examples of (minimal) columns present for various genomecomb files:
These files can easily be queried using the cg select functionality or can be loaded into a local database.
The format does not use quoting, so values in the table cannot contain tabs or newlines, unless by coding them using escape characters (\t,\n)
While not necessary for many of the commands, using the specific organisation of files in a Genomecomb project directory (described in projectdir) is useful: e.g. the process commands (e.g. cg process_project,cg process_sample) to run an entire analysis pipeline expect this structure to start from and generates all additional data in this structure.
In the Howto section we give some extended examples on how to process ngs data and query the results.