Introduction

GenomeComb

GenomeComb provides tools to analyze, combine, annotate and query whole genome, exome or targetted sequencing data as well as transcriptome data. Variant files in tab-separated format from different sequencing datasets can be generated and combined taking into account which regions are actually sequenced (given as region files in tab-separated format), annotated and queried (several examples can be seen in the Howto_Query section). A graphical user interface able to browse and query multi-million line tab-separated files is also included.

The cg process_project command provides a pipeline to generate annotated multisample variant data, reports, etc. starting from various raw data source material (e.g. fastq files,Complete Genomics data), which can be run locally (optionally using multiple cores) or distributed on a cluster. It combines many of the genomecomb commands that are also available separately (reference)

File formats

While genomecomb understands and produces most of the typical formats used in ngs analysis (bam/cram, vcf, bed, ...), the central, standard file format used in GenomeComb is the widely supported, simple, yet flexible tab-separated values file (format_tsv). This text format contains tabular data, where each line is a record, and each field is separated from the next by a TAB character. The first line (not starting with a #) is a header indicating the names of each column (or field). Lines starting with a # preceeding the header are comments (and may store metadata on the file). The file extension .tsv can be used to refer to this format.

Depending on which columns are present, tsv files can be used for various purposes. Usually the files are used to describe features on a reference genome sequence. In this intro a number of basic fields and uses are described. Refer to the format_tsv help for more in depth info on the format and its uses. Some typical fields are:

chromosome: chromosome name. Many genomecomb tools allow mixing chr1 and 1 notations
begin: start of feature. half-open coordinates as used by UCSC bed files and Complete Genomics files are expected. This means for instance that the first base of a sequence will be indicated by start=0 and end=1. An insertion before the first base will have start-0, end=0.
end: end of feature in half-open coordinates
type: type of variation: snp, ins, del, sub are recognised
ref: genotype of the reference sequence at the feature. For large deletions, the size of the deletion can be used.
alt: alternative genotype(s). If there are more than one alternatives, they are separated by commas.
alleleSeq1: gentype of features at one allele
alleleSeq2: gentype of features at other allele

Most tools expect the tsv files to be sorted on chromosome,begin,end,type and will create sorted files. You can sort files using the -s option of cg select. Not all the columns must be present, and any other columns can be added and searched. In files containing data for multiple samples, columns that are specific to a sample have -samplename appended to the column name. Some examples of (minimal) columns present for various genomecomb files:

region file: chromosome begin end.
variant file: chromosome begin end type ref alt ?alleleSeq1? ?alleleSeq2?
multicompar file: chromosome begin end type ref alt alleleSeq1-sample1 alleleSeq2-sample1 alleleSeq1-sample2 alleleSeq2-sample2 ...

These files can easily be queried using the cg select functionality or can be loaded into a local database.

The format does not use quoting, so values in the table cannot contain tabs or newlines, unless by coding them using escape characters (\t,\n)

Project directories

While not necessary for many of the commands, using the specific organisation of files in a Genomecomb project directory (described in projectdir) is useful: e.g. the process commands (e.g. cg process_project,cg process_sample) to run an entire analysis pipeline expect this structure to start from and generates all additional data in this structure.

How to start

In the Howto section we give some extended examples on how to process ngs data and query the results.

Home

Contact

Installation

Documentation

Introduction

GenomeComb

File formats

Project directories

How to start