bcol format

The bcol format stores one (or more) column(s) of data in an efficient binary format. All data must be of the same type (list of supported types below). bcol files can e.g. used to store a score for each position of the genome and use this for annotation.

The bcol format consists of two files, one in a text format describing the bcol data (bcolfile.bcol) and one containing the actual binary data (bcolfile.bcol.bin.zst). There may be a (third) index file (bcolfile.bcol.bin.zsti) for faster random access to the compressed binary data. bcol files can be created based on a tab-separated file (tsv) file using the command cg bcol make.

There was an older version of the bcol format (which did not support multiple chromosomes or multiple alleles) that is not described here. Files (there were separate bcol files per chromsome) in this older format can be converted to the current using

cg bcol_update newbcolfile oldbcolfile ?oldbcolfile2? ...

bcolfile.bcol

bcolfile.bcol is a tab-separated file (tsv) with the following format

# binary column
# type iu
# default 0
# precision 3
chromosome      begin   end
1       0       50
2       10      51

The fields "chromosome begin end" must be present, and indicate for which regions data is present in the binary file. They folow the same zero-based half-open indexing scheme as all genomecomb tsv files.

The comments before the table must contain the following information:

The first comment line "# binary column" indicates it is a bcol file.
The "# type" comment line indicates the type of data in the binary file (supported types listed below)
The "# default" indicates the default value: the return value for any position not in the file.
"# precision" indicates the default precision used when outputting values from the bcol.

A bcol file can contain multiple values for each position. This is indicated by the extra field multi in the bcol comments. e.g. "# multi A,C,T,G" is used in a var bcol to indicate that there are 4 values for each position for the bases A, C, T and G respectively.

Supported types

d: double-precision floating point (64 bit) in machine native byte order
q: double-precision floating point (64 bit) in little-endian byte order
Q: double-precision floating point (64 bit) in big-endian byte order
f: Single-precision floating point (32 bit) in machine native byte order
r: Single-precision floating point (32 bit) in little-endian byte order
R: Single-precision floating point (32 bit) in big-endian byte order
w: wide integer (64-bit) in little-endian byte order
W: wide integer (64-bit) in big-endian byte order
m: wide integer (64-bit) in machine native byte order
i: integer (32-bit) in little-endian byte order
I: integer (32-bit) in big-endian byte order
n: integer (32-bit) in machine native byte order
s: small integer (16 bit) in little-endian byte order
S: small integer (16 bit) in big-endian byte order
t: small integer (16 bit) in machine native byte order
c: 8 bit integer (1 byte or character) The integer types can have an u appended for unsigned types

bcolfile.bcol.bin.zst

bcolfile.bcol.bin.zst is the (compressed) binary file containing the actual data, a lz3 compressed continuous stream of values in the given binary type. The uncompressed binary file (bcolfile.bcol.bin) is also recognized, but by default the binaries are compressed. A zst index file bcolfile.bcol.bin.zst.zsti may also exists to speed up random access to the compressed binary.

Home

Contact

Installation

Documentation

bcol format

bcolfile.bcol

Supported types

bcolfile.bcol.bin.zst