genome_seq
Format
cg genome_seq ?options? regionfile/regions dbdir ?outfile?
Summary
Returns sequences of regions in the genome (fasta file),
optionally masked for snps/repeats
Description
This command returns the sequences of the genomic regions given in
the file regionfile in fasta format (to stdout or to a file outfile).
Regionfile is a tab delimited file with at least following columns:
chromosome begin end. Repeatmasker repeats are softmasked (lower
case) in the output sequences. Optionally you can hardmask repeats,
and soft or hardmask known (dbsnp) variants based on frequency.
Arguments
- regionfile
- tab delimited file containing targets with at least following
columns: chromosome begin end.
- regions
- If the string given for regionfile/regions does not
exist as a file, it is parsed as a list of regions, given by
chromosome,begin,end that can be separated in a variety of ways
(colon, dash, comma, space or newlines), e.g., all of the following
formats are accepted: 'chr1:100-200,chr2:100-200' 'chr1 100 200
chr2 100 200' 'chr1-100-200 chr2-100-200'
- dbdir
- directory containing reference genomes and variation data
Options
- -f freq (--freq)
- only softmask (lowercase) dbsnp variants if they have a
frequency > freq (given as a fraction, default is 0, use -1 to
include all)
- -fp freqp (--freqp)
- only softmask (lowercase) dbsnp variants if they have a
frequency > freqp (given as a percentage, default is 0, use -1
to include all)
- -n freqn (--freqn)
- only mask (using N) dbsnp variants if they have a frequency
> freqn (given as a fraction, default is 0.2, use -1 to include
all)
- -np freqnp (--freqnp)
- only mask (using N) dbsnp variants if they have a frequency
> freqnp (given as a percentage, default is 20, use -1 to
include all)
- -p snpdbpattern (--snpdbpattern)
- determines which variant databases are used
(dbdir/var_*snpdbpattern*.tsv.gz). default is "snp" for
dbsnp. you can e.g. use "Common" for the common variants
in dbsnp
- -d delsize (--delsize)
- only mask (using N) dbsnp variants if they are smaller than
delsize (default is 5, use -1 to include all)
- -r repeatmasker (--repeatmasker)
- how to mask repeatmasker repeats: "s" means
softmask (lowercase), use "N" to mask using Ns, and 0 for
no repeatmasking (default is "s")
- -i idcolumn (--id)
- The ids for the fasta file will be taken from the given
column (location will be added after a space)
- -c concatseq (--concat)
- using this option, all regions will be concatenated into one
sequence with concatseq between them. To just concatenate the
sequences, use -c ''
- -m mapfile (--mapfile)
- Create a map file that describes which regions in the newly
created fasta file map to which regions in the genome
- --namefield namefield
- entries in the map file will have a name obtained from the
namefield column in the region file
- -cn concatname (--concatname)
- The concatname wil be the name of sequence in the fasta file
generated (if not given, the name will be based on the file)
- -e concatend (--concatend)
- The sequence given by concatend will be added to start and
end of the final sequence (only if -c option was used)
- -ca concatadj (--concatadj)
- The concatseq (-c option) will only be added if regions are
separated by at least one base. concatadj will be used to concat
adjoining regions (and is '' by default)
- -g windowsize (--gc)
- add gc content on id line. if windowsize 0 only total gc
content will be added. For windowsize > 0, the max gc content
for the given windowsize will also be added (default = -1 for no gc
content)
- -gs gccontent (--gcsplit)
- Split the result in low and high gc (high has gc >=
gccontent). The gc used depends on the -gc option. If -gc is
not given, the maxgc at a windowsize of 100 is used. This option
cannot be combined with concatenating sequences, and outfile has to
be specified. 2 files will be generated with lowgc and highgc added
in the given outfile name.
- -gd 0/1 (--gcdisplay)
- determines if the gc content is actually displayed on the
name line. By setting this to 0, you can set a windowsize (using
-g) to split the files on, without the gc content being displayed
on the name line If you set -gd to 1 without setting -g, the total
gc content will be shown
- -s 0/1 (--split)
- If this option is 1, each region will be saved as a separate
fasta file. The
- -l char (--limitchars)
- Replace all but alphanumeric characters, _, . and - in the
sequence names by char
Category
Validation