natural sort
Most software uses an ASCII or lexical sort. This often results in
an order that is unexpected by most people, e.g. chr11 sorted before
chr2. A natural sort interprets strings with numbers in them in a way
that seems more natural to most people (e.g. chr1,chr2,chr11)
All sorting in genomecomb uses a natural sort that can sort both
strings and numbers as well as combinations of these. This sorting is
very important because most of the genomecomb tools process large
files without loading them into memory by going over them line by
line. This is only possible if they are properly sorted.
There is no standard, well defined way to do a natural sort. While
the chr2,chr11 case is obvious, there are quite a few less clear
cases with multiple possible ways to handle them. Files sorted using
another natural sort algorithm might run into these. Below is
descibed how genomecomb handles natural sort:
  - strings are sorted alphabetically, based largely on ASCII order
  (Things like locale or unicode are not taken into account) with the
  following changes:
    - Numbers sort according to their numerical value, how
    exactly is explained further.
- asterisk (*) sorts after everything except control+space
    (this gets sam files to sort automatically as expected without
    further processing )
- The plus character (+) sorts right after the digits (and
    thus also after -)
- letters sort as A,a,B,... Case is not taken into account
    except as a secondary measure (if the strings are the same except
    for case) when sorting, e.g: A a MRNA mRNA mrna
- The characters that are between the upper and lower case
    letters in ASCII ([,\,],^,_,`) sort after the letters
 
  - Numbers are sorted according to their numerical value; they
  sort before strings, except strings starting with whitespace or
  empty strings.
  - A standalone (full/isolated) number starts at the start
  of the string or after whitespace. It follows a number pattern
  supporting signs, decimals and scientific notation (e.g.
  -10.10e-10). standolone numbers are sorted according to their
  numerical value, taking into account negative, decimal and
  scientific notation, e.g: -1e4 -100 -0.5e2 -2 -1 -0.2 -0.10 -0.1 0
  0.1 0.10 0.2 2 10 1e2
    - Even if isolated numbers have string characters AFTER them,
    they will be interpreted as full numbers, e.g: 0.1g 0.10g 0.2g
 
  - Embedded numbers are numbers that are not at the start
  of an element (e.g. chr1,chr2,chr10). Embedded numbers are also
  sorted according to numerical value, but only based on the integer
  part (i.e if present -,e,+ are considered part of the string in
  this case). e.g. in chr-1 we do not interpret the - as a minus,
  giving us the following sort: chr-1 chr-2 chr-10
  - A number may start with a + or a number of zeros; these are
  ignored unless the numerical value is the same between 2 elements.
  In that case the numbers with + in front will sort after the ones
  without, and the ones with fewer zeros are sorted after the ones
  with more zeros.
  - characters after a full number (not separated by whitespace)
  will be interpreted as a string. Numbers in these are treated as
  embedded numbers, (e.g.: -2 -1a -1a2 -1a10), even if it would be a
  number on its own (e.g.: -2 -1-1 -1-2 -1-10)