makerefdb

Format

cg makerefdb ?options? dbdir

Summary

Create a reference sequence and annotation databases directory

Description

The reference directory contains the reference genome sequence with various indexes and the accompanying annotation databases. The cg makerefdb command creates such a basic reference directory for a given genome based on data in the UCSC genome browser (https://genome.ucsc.edu/) (except for the miRNA genes which are obtained from mirbase).

The name of the directory indicates which genome data has to be downloaded: It should match the UCSC Genome Browser assembly ID of the desired genome.

By default the reference sequence and some annotation databases will be downloaded. Using the options you can adjust which annotations are added. For some genomes genomecomb includes a script (e.g. makedbs/makedbs_hg38.sh for human)with extended preset options and adding (much) more annotations not coming from UCSC.

Some options give information that cannot be easily derived from the downloads, and can be important for specific analysis. These are best supplied: The -organelles option lists chromosomes that are (actually) organelle genomes (these are treated differently e.g. by scywalker analysis)

When distributing over chromosomes (or regions), by default alt and unplaced chromosomes (usually small) are grouped together. This default relies on the fact that these (often) contain a "_", and so, e.g. chr1_KI270706v1_random, chr1_KI270707v1_random are grouped under chr1_. For some genomes this default is not ideal, e.g. the dual human-mouse genome provided by cellranger makes only 2 groups (one per genome) You can use the -groupchromosomes option to specify a different way of grouping: Each chromosome is matched to the given list of regular expressions, and if it matches one the chromosome is assigned to a group named after the match. The resulting file genomeseq.groupchromosomes in the reference directory can be manually adjusted after creation (it is a tsv file with assignments of each chromosome to a group).

Arguments

dbdir: name/path of result directory

Options

-genomeurl url: url to download the genomesequence from (instead of the UCSC default)
-organelles list: list of chromosome names that are organelle genomes (these are treated differently e.g. by scywalker analysis)
-pseudoautosomal list: list of pseudoautosomal regions
-groupchromosomes list: list of patterns indicating which (small) chromosomes should be grouped for distrreg
-genesdb list
-mirbase string
-regionsdb_collapse
-regionsdb_join
-dbsnp
-refSeqFuncElemsurl
-transcriptsurl
-transcriptsgtf
-webcache

This command can be distributed on a cluster or using multiple with job options (more info with cg help joboptions)

Example

The following example downloads the C. elegans ce11 reference sequence and some annotation databases. It will distribute processing of the data over 4 cores (-d 4).


cg makerefdb -d 4 -v 2 \
    -regionsdb_collapse '
        simpleRepeat rmsk phastConsElements26way phyloP135way
    ' \
    -regionsdb_join '' \
    -genesdb '
        {refGene int reg}
        {ws245Genes extra int reg} 
        {ensGene extra int reg}
        {genscan extra}
        {augustusGene extra}
    ' \
    -mirbase cel-22.1:ce11 \
    /complgen/refseq/ce11

Home

Contact

Installation

Documentation