Var

Format

cg iso_isoquant ?options? bamfile ?resultfile?

Summary

Call and count isoforms and genes using isoquant.

Description

The command can be used to call and count isoforms and genes using a method based on isoquant (https://github.com/ablab/IsoQuant).

Following adaptations are done when running isoquant within genomecomb

running is (can be) heavily parallelised in a way that distribution over a cluster is also possible
Known transcripts and novel predictions/models are merged into one result
The main output is in the transcript format.
Different types of counting are available from one run ; iq_all: isoquant standard output for the all option ; weighed: reads supporting multiple (N) transcripts are weighed as 1/N ; unique: count only reads uniquely supporting one transcript ; strict: only unique reads that cover >= 90% of the transcript ; aweighed, aunique, astrict: same as above, but reads must have polyA (detected)

By default (if resultfile is not given) the names of the resultfiles are derived from the bam file name. For a bamfile map-$root.bam (where $root is the name of the sample and analysis), the main result file will in the same directory as the bam file and named isoform_counts-isoquant-$root.tsv (or isoform_counts-isoquant_$preset-$root.tsv if a preset was specified) The gene_counts and read_assignments files are named similarly (gene_counts-isoquant-$root.tsv and read_assignments-isoquant-$root.tsv).

Arguments

bamfile: alignment on which to call isoforms
resultfile: resulting isoform count file instead of default based on bam file name

Options

-preset preset: select one of a number of presets (default ont) Possible options are ont: optimal settings for ONT reads (this is the defaul) sens: sensitive settings for ONT reads, more transcripts are reported possibly at a cost of precision all: reports almost all novel transcripts, loses precision in favor to recall (is also slower) pacbio: optimal settings for PacBio CCS reads (default_pacbio in isoquant) pacbiosens: sensitive settings for PacBio CCS reads, more transcripts are reported possibly at a cost of precision (sensitive_pacbio in isoquant) pacbioall: reports almost all novel transcripts (for pacbio), loses precision in favor to recall (is also slower) assembly: optimal settings for a transcriptome assembly: input sequences are considered to be reliable and each transcript to be represented only once, so abundance is not considered
-distrreg: distribute regions for parallel processing (default g). Possible options are 0: no distribution (also empty) 1: default distribution schr or schromosome: each chromosome processed separately chr or chromosome: each chromosome processed separately, except the unsorted, etc. with a _ in the name that will be combined), a number: distribution into regions of this size a number preceded by an s: distribution into regions targeting the given size, but breaks can only occur in unsequenced regions of the genome (N stretches) a number preceded by an r: distribution into regions targeting the given size, but breaks can only occur in large (>=100000 bases) repeat regions a number preceded by an g: distribution into regions targeting the given size, but breaks can only occur in large (>=200000 bases) regions without known genes g: distribution into regions as given in the <refdir>/extra/reg_*_distrg.tsv file; if this does not exist uses g5000000 a file name: the regions in the file will be used for distribution
-addumis 0/1: if 1, add umicount fileds to the result tables. The reads in the bam file must have umi info incorporated in te read_id (as umi#readname)
-refseq file
-reftranscripts file
-transcript_quantification string
-gene_quantification string
-data_type string
-splice_correction_strategy string
-model_construction_strategy string
-matching_strategy string
-threads number
-skip list
-regions list
-skipregions list
-cleanup 0/1

Results

read_asignments

The read_assignment files returns for each read where it aligns and which genes/isoforms it supports. For reads that could have come from multiple isoforms/genes, multiple lines are present (one for each isoform/gene). In this case the ambiguity and gambiguity fields indicate how many isoforms and genes are supported.

read_id
chromosome
begin
end
strand
exonStarts
exonEnds
aligned_size
isoform_id
gene_id
assignment_type: how does the read match the assigned transcript (unique, inconsistent, ..)
assignment_events: lists noted differences with the asigned transcript
inconsistency: level of inconsistency of the read with the assigned transcript based on assignment_events (reads with inconsistency >= 2 are not counted for an isoform) 0: unique or no inconsistencies of note 1: minor differences such as probable alignment artifacts, alternative transcription start / end 2: (incomplete) intron retentions 3: major inconsistencies (major_exon_elongation,alternative_structure,alt_donor_site,alt_acceptor_site, ...)
additional_info
ambiguity: number of isoforms the read supports (could come from), isoforms from different genes are included
gambiguity: number of genes the read supports (< ambiguity: multiple isoforms from one gene are counted as one)
covered_pct: size of the (aligned) read vs the total size of the isoform (in percent)
polya: "True" if a polyA was detected in the read, "False" if no polyA was detected on the read (empty if read does not support any isoform)
classification
closest_known
cellbarcode
umi
umicount: number of reads with the same umi (and barcode)

Home

Contact

Installation

Documentation

Var