GenomeComb

natural sort

Most software uses an ASCII or lexical sort. This often results in an order that is unexpected by most people, e.g. chr11 sorted before chr2. A natural sort interprets strings with numbers in them in a way that seems more natural to most people (e.g. chr1,chr2,chr11)

All sorting in genomecomb uses a natural sort that can sort both strings and numbers as well as combinations of these. This sorting is very important because most of the genomecomb tools process large files without loading them into memory by going over them line by line. This is only possible if they are properly sorted.

There is no standard, well defined way to do a natural sort. While the chr2,chr11 case is obvious, there are quite a few less clear cases with multiple possible ways to handle them. Files sorted using another natural sort algorithm might run into these. Below is descibed how genomecomb handles natural sort: