GenomeComb

Var

Format

cg iso_isoquant ?options? bamfile ?resultfile?

Summary

Call and count isoforms and genes using isoquant.

Description

The command can be used to call and count isoforms and genes using a method based on isoquant (https://github.com/ablab/IsoQuant).

Following adaptations are done when running isoquant within genomecomb

By default (if resultfile is not given) the names of the resultfiles are derived from the bam file name. For a bamfile map-$root.bam (where $root is the name of the sample and analysis), the main result file will in the same directory as the bam file and named isoform_counts-isoquant-$root.tsv (or isoform_counts-isoquant_$preset-$root.tsv if a preset was specified) The gene_counts and read_assignments files are named similarly (gene_counts-isoquant-$root.tsv and read_assignments-isoquant-$root.tsv).

Arguments

bamfile
alignment on which to call isoforms
resultfile
resulting isoform count file instead of default based on bam file name

Options

-preset preset
select one of a number of presets (default ont) Possible options are ont: optimal settings for ONT reads (this is the defaul) sens: sensitive settings for ONT reads, more transcripts are reported possibly at a cost of precision all: reports almost all novel transcripts, loses precision in favor to recall (is also slower) pacbio: optimal settings for PacBio CCS reads (default_pacbio in isoquant) pacbiosens: sensitive settings for PacBio CCS reads, more transcripts are reported possibly at a cost of precision (sensitive_pacbio in isoquant) pacbioall: reports almost all novel transcripts (for pacbio), loses precision in favor to recall (is also slower) assembly: optimal settings for a transcriptome assembly: input sequences are considered to be reliable and each transcript to be represented only once, so abundance is not considered
-distrreg
distribute regions for parallel processing (default s50000000). Possible options are 0: no distribution (also empty) 1: default distribution schr or schromosome: each chromosome processed separately chr or chromosome: each chromosome processed separately, except the unsorted, etc. with a _ in the name that will be combined), a number: distribution into regions of this size a number preceded by an s: distribution into regions targeting the given size, but breaks can only occur in unsequenced regions of the genome (N stretches) a number preceded by an r: distribution into regions targeting the given size, but breaks can only occur in large (>=100000 bases) repeat regions a number preceded by an g: distribution into regions targeting the given size, but breaks can only occur in large (>=200000 bases) regions without known genes a file name: the regions in the file will be used for distribution
-refseq
-reftranscripts
-transcript_quantification
-gene_quantification
-data_type
-splice_correction_strategy
-model_construction_strategy
-matching_strategy
-threads
-skip
-regions
-skipregions
-cleanup

Results

read_asignments

The read_assignment files returns for each read where it aligns and which genes/isoforms it supports. For reads that could have come from multiple isoforms/genes, multiple lines are present (one for each isoform/gene). In this case the ambiguity and gambiguity fields indicate how many isoforms and genes are supported.

read_id
chromosome
begin
end
strand
exonStarts
exonEnds
aligned_size
isoform_id
gene_id
assignment_type
how does the read match the assigned transcript (unique, inconsistent, ..)
assignment_events
lists noted differences with the asigned transcript
inconsistency
level of inconsistency of the read with the assigned transcript based on assignment_events (reads with inconsistency >= 2 are not counted for an isoform) 0: unique or no inconsistencies of note 1: minor differences such as probable alignment artifacts, alternative transcription start / end 2: (incomplete) intron retentions 3: major inconsistencies (major_exon_elongation,alternative_structure,alt_donor_site,alt_acceptor_site, ...)
additional_info
ambiguity
number of isoforms the read supports (could come from), isoforms from different genes are included
gambiguity
number of genes the read supports (< ambiguity: multiple isoforms from one gene are counted as one)
covered_pct
size of the (aligned) read vs the total size of the isoform (in percent)
polya
"True" if a polyA was detected in the read, "False" if no polyA was detected on the read (empty if read does not support any isoform)
classification
closest_known
cellbarcode
umi
umicount

Category

RNA