GenomeComb
This text describes an example of how to use cg process_project to analyze, combine, and annotate whole genome, exome or targetted Illumina sequencing raw data using genomecomb. Starting from raw data, a fully annotated projectdir will be created, that can be further queried and analyzed using cg select and cg viz
In this howto, the smal example and test data set ori_mixed_yri_mx2 downloadable from the genomecomb website will be used. This data set was derived from publically available exome and genome sequencing data by extracting only raw data covering the region of the MX2 gene (on chr21) and a part of the ACO2 gene (on chr22).
All analyses happen in a project directory (extensively described in projectdir): You can create a project directory and add raw data using the cg project_addsample command as described below. You can also just create the basic structure (fastq files in directories like <projectdir>/samples/<sampledir>/fastq/<fastqfile>) using other means.
We start by creating and populating a projectdir (tmp/mixed_yri_mx2) with samples using the cg project_addsample command. (The projectdir will be made with the first addsample call if it does not yet exists)
# get the example data wget http://genomecomb.bioinf.be/download/dataset_mixed_yri_mx2.tar.gz tar xvzf dataset_mixed_yri_mx2.tar.gz cd dataset_mixed_yri_mx2 # Create the projectdir # This is not strictly needed: # It would be made with the first addsample call if it did not yet exists mkdir -p tmp/mixed_yri_mx2 # Add a Complete genomics sample named cgNA19240mx2. # ori/mixed_yri_mx2/cgNA19240mx2 contains an ASM directory as obtained from Complete Genomics # by default the project_addsample creates softlinks to the fastqs in the original directory cg project_addsample tmp/mixed_yri_mx2 cgNA19240mx2 ori/mixed_yri_mx2/cgNA19240mx2 # Add illumina fastq files from a genome sequencing using wildcards to find all fastq files cg project_addsample tmp/mixed_yri_mx2 gilNA19240mx2 ori/mixed_yri_mx2/gilNA19240mx2/*.fq.gz # Add illumina fastq files from exome sequencing # The option -targetfile gives a regionfile with the target regions for this sample # (The \ at the end of the first line means the the command is continued on the next line) cg project_addsample -targetfile ori/mixed_yri_mx2/reg_hg19_exome_SeqCap_EZ_v3.tsv.lz4 \ tmp/mixed_yri_mx2 exNA19239mx2 ori/mixed_yri_mx2/exNA19239mx2/*.fq.gz # Add another exome sequencing sample # Here we do not give the individual fastq files as argument, but the directory containing them. # you can only do this if the directory does not contain fastq files from other samples cg project_addsample -targetfile ori/mixed_yri_mx2/reg_hg19_exome_SeqCap_EZ_v3.tsv.lz4 \ tmp/mixed_yri_mx2 exNA19240mx2 ori/mixed_yri_mx2/exNA19240mx2
The cg process_project command will run the entire secondary analysis (clipping, alignment, variant calling, reports, ...) and part of the tertiary analysis (combining samples, annotation, ...) on the samples in the projectdir Reference information (genome sequence, annotation files, etc.) are in /complgen/refseq/hg19
cg process_project -verbose 2 -d 2 \ -paired 1 -clip 1 \ -aligner 'bwa' \ -removeduplicates 1 \ -realign 1 \ -distrreg 0 \ -aliformat cram \ -varcallers 'gatkh strelka' \ -svcallers 'manta' \ -refdir /complgen/refseq/hg19 \ tmp/mixed_yri_mx2
If the command is interupted, The processing can continue from where it was by repeating the same command.
Processing can be distributed over more than one processing unit: The -d 2 parameter in the example distibutes the processing over 2 cores on the local machine. If you have a cluster, you can distribute on the cluster using -d sge (for Sun Grid Engine) or -d slurm
The -distrreg is 0 because for such a small data set (covering only a small part of the genome) more extensive distribution would actually give more work (overhead) than it saves (esp. giving it only 2 cores to work with)
Errors in the submission command (e.g. the given reference dir does not exists) are returned immediately. You can get more extensive information on such errors by adding the -stack 1 option (and possible the optiuon -v 2) When running distributed (option -d with sge, slurm, or a number of cores), scywalker can also encounter errors in the submitted jobs. Information on submitted jobs is gathered in a directory log_jobs (files per job) and in a tsv log file name typically named process_project*.<ext>, where <ext> can be
When there was an error in one job, all jobs that depend on results of that job will also have errors (dependencies that are not found), so you typically want to look for the first error. You can do this by checking/querying the (tsv) log file. The convenience function error_report can be used to get a more nicely formatted overview of the errors (if you do not specify the logfile, it will take the most recent one in the current working directory):
cg error_report ?logfile?
In this you can check the error messages, time run, etc. With the runfile given in this output, you can try to run specific jobs separately
All analysis results are added in the projectdir, as described in projectdir. You can check the process_project_*.finished file (tsv) for information on how long jobs took etc. If there was an error, the file will be called process_project_*.error, and contains info on the error(s).
cg viz allows an easy (graphical interface) view of the results.
cg viz tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.zst
opens the annotated combined variant file (format described in format_tsv) using cg viz, allowing you to browse through the table (even if it is millions of lines long). Look in howto_view for some examples of what to do in cg viz.
cg select can be used to query the result files: selecting or adding fields, selecting variants fullfilling given criteria or making summaries using a command line tool (functionality is similar to cg viz, with a few extras). As such it can be more easily integrated in other workflows and exact logging of what was done is easier dan using cg viz.
Some examples are shown here, you can find more examples in howto_query and an extensive description of the possibilities in the cg select help.
# Which fields are in the file, results are shown on the terminal (stdout) cg select -h tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.zst # Which "samples" are in the file, results are shown on the terminal (stdout) cg select -n tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.zst # Write a file tmp/short.tsv containing only the basic variant fields cg select -f 'chromosome begin end type ref alt' \ tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.zst tmp/short.tsv # Add a field count that calculates in how many of the "NA19240 samples" zyg-gatkh-rdsbwa # (zygosity call by gatk haplotype caller) is either t or m cg select -overwrite 1 \ -f 'chromosome begin end type ref alt {count=scount($zyg-gatkh-rdsbwa in "t m" and $sample regexp "NA19240")}' \ tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.zst tmp/short.tsv # Select only variants where gatk has made a homozygous call in sample gilNA19240mx2 # The results are compressed using "piping": # The results of cg select are transferred (indicated by the | or pipe character) to "cg zst" # cg zst compresses the data. # the > character redirects the results of the compression to the file tmp/homozygous.tsv.zst cg select -q '$zyg-gatkh-rdsbwa-gilNA19240mx2 == "m"' \ tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.zst | cg zst > tmp/homozygous.tsv.zst # intGene_impact contains the impact of the variant on all alternative splice variants in the intGene # gene dataset. intGene_impact contains a list of impacts (so we cannot use ==). # You can also query using regular expression. The following selects only variants with # impact matching the regexp pattern "UTR|RNA": containing UTR or RNA cg select -q '$intGene_impact regexp "UTR|RNA"' \ tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.zst tmp/utr.tsv # You can also properly check the list vs a given list of impacts # (getting the correct list may be easier in cg viz using Easyquery) # shares is true if the list in intGene_impact shares elements with the given list cg select \ -q '$intGene_impact shares {RNASTART UTR5KOZAK UTR5SPLICE UTR5 RNAEND UTR3SPLICE UTR3 RNASPLICE RNA}' \ tmp/mixed_yri_mx2/compar/annot_compar-mixed_yri_mx2.tsv.zst tmp/utr2.tsv
The following is an example text describing the default process_project workflow for Illumina sequencing with the proper references (The <version> actually used for the various tools can be found in the analysisinfo files):
Analysis was performed in-house with a standardized pipeline integrated in genomecomb (1) The pipeline used fastq-mcf (2) for adapter clipping. Reads were aligned to the hg38 genome reference (3) using bwa mem (4) and the resulting sam file converted to bam using samtools (5). Bam files were sorted and duplicates were removed using samtools markdup (5). Realignment in the neighborhood of indels was performed with GATK (6). Variants were called at all positions with a totalcoverage >= 5 using both GATK haplotype caller (7) and strelka (8). At the initial stage positions with a coverage < 8 or a genotype quality score < 25 were considered unsequenced. Structural variants were called using manta (9) and lumpy (10). The resulting variant sets of different individuals were combined and annotated using genomecomb (1). (1) genomecomb <version>, Reumers, J*, De Rijk, P*, Zhao, H, Liekens, A, Smeets, D, Cleary, J, Van Loo, P, Van Den Bossche, M, Catthoor, K, Sabbe, B, Despierre, E, Vergote, I, Hilbush, B, Lambrechts, D and Del-Favero, J (2011) Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing. Nature biotechnology, 30, 61-88 [PMID: 22178994] (2) fastq-mcf <version>, Erik Aronesty (2011). ea-utils : "Command-line tools for processing biological sequencing data"; Expression Analysis, Durham, NC http://code.google.com/p/ea-utils (3) hg38 (GRCh38) downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz on 2020-06-06, Schneider V.A. et al. (2017) Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017 May; 27(5): 849\u2013864. [PMID: 28396521] (4) bwa <version>, Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168] (5) samtools <version>, Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943] (6) GATK 3.8-1-0-gf15c1c3ef, DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. NATURE GENETICS 43:491-498 [PMID: 21478889] (7) GATK <version>, DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. NATURE GENETICS 43:491-498 [PMID: 21478889] (8) strelka <version>, Kim S. et al. (2018) Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods. 15(8):591-594. [PMID: 30013048] (9) manta <version>, Chen, X. et al. (2016) Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics, 32, 1220-1222 [PMID: 26647377] (10) lumpy <version>, Layer M.R. et al. (2014) LUMPY: a probabilistic framework for structural variant discovery. Genome biology, 15(6), R84 [PMID: 24970577]
Adapt this according to the analysis tools used. The actual versions used can be found in the analysisinfo files. In the example we did not run lumpy (which gives an error on too small a dataset), so this should be removed from the text. If other tools were used, add them in the text and the references.
For amplicon sequencing, the markduplicates part should be removed, and the following part added after making the realignment:
Amplicon primers were clipped using genomecomb (1)