Var
Format
cg iso_isoquant ?options? bamfile ?resultfile?
Summary
Call and count isoforms and genes using isoquant.
Description
The command can be used to call and count isoforms and genes using
a method based on isoquant (https://github.com/ablab/IsoQuant).
Following adaptations are done when running isoquant within
genomecomb
- running is (can be) heavily parallelised in a way that
distribution over a cluster is also possible
- Known transcripts and novel predictions/models are merged
into one result
- The main output is in the transcript format.
- Different types of counting are available from one run ;
iq_all: isoquant standard output for the all option ; weighed:
reads supporting multiple (N) transcripts are weighed as 1/N ;
unique: count only reads uniquely supporting one transcript ;
strict: only unique reads that cover >= 90% of the transcript ;
aweighed, aunique, astrict: same as above, but reads must have
polyA (detected)
By default (if resultfile is not given) the names of the
resultfiles are derived from the bam file name. For a bamfile
map-$root.bam (where $root is the name of the sample and analysis),
the main result file will in the same directory as the bam file and
named isoform_counts-isoquant-$root.tsv (or
isoform_counts-isoquant_$preset-$root.tsv if a preset was specified)
The gene_counts and read_assignments files are named similarly
(gene_counts-isoquant-$root.tsv and
read_assignments-isoquant-$root.tsv).
Arguments
- bamfile
- alignment on which to call isoforms
- resultfile
- resulting isoform count file instead of default based on bam
file name
Options
- -preset preset
- select one of a number of presets (default ont) Possible
options are ont: optimal settings for ONT reads (this is the
defaul) sens: sensitive settings for ONT reads, more
transcripts are reported possibly at a cost of precision all:
reports almost all novel transcripts, loses precision in favor to
recall (is also slower) pacbio: optimal settings for PacBio
CCS reads (default_pacbio in isoquant) pacbiosens: sensitive
settings for PacBio CCS reads, more transcripts are reported
possibly at a cost of precision (sensitive_pacbio in isoquant)
pacbioall: reports almost all novel transcripts (for pacbio), loses
precision in favor to recall (is also slower) assembly: optimal
settings for a transcriptome assembly: input sequences are
considered to be reliable and each transcript to be represented
only once, so abundance is not considered
- -distrreg
- distribute regions for parallel processing (default
s50000000). Possible options are 0: no distribution (also
empty) 1: default distribution schr or schromosome: each
chromosome processed separately chr or chromosome: each
chromosome processed separately, except the unsorted, etc. with a _
in the name that will be combined), a number: distribution into
regions of this size a number preceded by an s: distribution
into regions targeting the given size, but breaks can only occur in
unsequenced regions of the genome (N stretches) a number
preceded by an r: distribution into regions targeting the given
size, but breaks can only occur in large (>=100000 bases) repeat
regions a number preceded by an g: distribution into regions
targeting the given size, but breaks can only occur in large
(>=200000 bases) regions without known genes a file name:
the regions in the file will be used for distribution
- -refseq
- -reftranscripts
- -transcript_quantification
- -gene_quantification
- -data_type
- -splice_correction_strategy
- -model_construction_strategy
- -matching_strategy
- -threads
- -skip
- -regions
- -skipregions
- -cleanup
Results
read_asignments
The read_assignment files returns for each read where it aligns
and which genes/isoforms it supports. For reads that could have come
from multiple isoforms/genes, multiple lines are present (one for
each isoform/gene). In this case the ambiguity and gambiguity fields
indicate how many isoforms and genes are supported.
- read_id
- chromosome
- begin
- end
- strand
- exonStarts
- exonEnds
- aligned_size
- isoform_id
- gene_id
- assignment_type
- how does the read match the assigned transcript (unique,
inconsistent, ..)
- assignment_events
- lists noted differences with the asigned transcript
- inconsistency
- level of inconsistency of the read with the assigned
transcript based on assignment_events (reads with inconsistency
>= 2 are not counted for an isoform) 0: unique or no
inconsistencies of note 1: minor differences such as probable
alignment artifacts, alternative transcription start / end 2:
(incomplete) intron retentions 3: major inconsistencies
(major_exon_elongation,alternative_structure,alt_donor_site,alt_acceptor_site,
...)
- additional_info
- ambiguity
- number of isoforms the read supports (could come from),
isoforms from different genes are included
- gambiguity
- number of genes the read supports (< ambiguity: multiple
isoforms from one gene are counted as one)
- covered_pct
- size of the (aligned) read vs the total size of the isoform
(in percent)
- polya
- "True" if a polyA was detected in the read,
"False" if no polyA was detected on the read (empty if
read does not support any isoform)
- classification
- closest_known
- cellbarcode
- umi
- umicount
Category
RNA