GenomeComb

genome_seq

Format

cg genome_seq ?options? regionfile/regions dbdir ?outfile?

Summary

Returns sequences of regions in the genome (fasta file), optionally masked for snps/repeats

Description

This command returns the sequences of the genomic regions given in the file regionfile in fasta format (to stdout or to a file outfile). Regionfile is a tab delimited file with at least following columns: chromosome begin end. Repeatmasker repeats are softmasked (lower case) in the output sequences. Optionally you can hardmask repeats, and soft or hardmask known (dbsnp) variants based on frequency.

Arguments

regionfile
tab delimited file containing targets with at least following columns: chromosome begin end.
regions
If the string given for regionfile/regions does not exist as a file, it is parsed as a list of regions, given by chromosome,begin,end that can be separated in a variety of ways (colon, dash, comma, space or newlines), e.g., all of the following formats are accepted: 'chr1:100-200,chr2:100-200' 'chr1 100 200 chr2 100 200' 'chr1-100-200 chr2-100-200'
dbdir
directory containing reference genomes and variation data

Options

-f freq (--freq)
only softmask (lowercase) dbsnp variants if they have a frequency > freq (given as a fraction, default is 0, use -1 to include all)
-fp freqp (--freqp)
only softmask (lowercase) dbsnp variants if they have a frequency > freqp (given as a percentage, default is 0, use -1 to include all)
-n freqn (--freqn)
only mask (using N) dbsnp variants if they have a frequency > freqn (given as a fraction, default is 0.2, use -1 to include all)
-np freqnp (--freqnp)
only mask (using N) dbsnp variants if they have a frequency > freqnp (given as a percentage, default is 20, use -1 to include all)
-p snpdbpattern (--snpdbpattern)
determines which variant databases are used (dbdir/var_*snpdbpattern*.tsv.gz). default is "snp" for dbsnp. you can e.g. use "Common" for the common variants in dbsnp
-d delsize (--delsize)
only mask (using N) dbsnp variants if they are smaller than delsize (default is 5, use -1 to include all)
-r repeatmasker (--repeatmasker)
how to mask repeatmasker repeats: "s" means softmask (lowercase), use "N" to mask using Ns, and 0 for no repeatmasking (default is "s")
-i idcolumn (--id)
The ids for the fasta file will be taken from the given column (location will be added after a space)
-c concatseq (--concat)
using this option, all regions will be concatenated into one sequence with concatseq between them. To just concatenate the sequences, use -c ''
-m mapfile (--mapfile)
Create a map file that describes which regions in the newly created fasta file map to which regions in the genome
--namefield namefield
entries in the map file will have a name obtained from the namefield column in the region file
-cn concatname (--concatname)
The concatname wil be the name of sequence in the fasta file generated (if not given, the name will be based on the file)
-e concatend (--concatend)
The sequence given by concatend will be added to start and end of the final sequence (only if -c option was used)
-ca concatadj (--concatadj)
The concatseq (-c option) will only be added if regions are separated by at least one base. concatadj will be used to concat adjoining regions (and is '' by default)
-g windowsize (--gc)
add gc content on id line. if windowsize 0 only total gc content will be added. For windowsize > 0, the max gc content for the given windowsize will also be added (default = -1 for no gc content)
-gs gccontent (--gcsplit)
Split the result in low and high gc (high has gc >= gccontent). The gc used depends on the -gc option. If -gc is not given, the maxgc at a windowsize of 100 is used. This option cannot be combined with concatenating sequences, and outfile has to be specified. 2 files will be generated with lowgc and highgc added in the given outfile name.
-gd 0/1 (--gcdisplay)
determines if the gc content is actually displayed on the name line. By setting this to 0, you can set a windowsize (using -g) to split the files on, without the gc content being displayed on the name line If you set -gd to 1 without setting -g, the total gc content will be shown
-s 0/1 (--split)
If this option is 1, each region will be saved as a separate fasta file. The
-l char (--limitchars)
Replace all but alphanumeric characters, _, . and - in the sequence names by char

Category

Validation