GenomeComb

Groupby

Format

cg groupby ?options? fields ?infile? ?outfile?

Summary

Group lines in a tsv file, based on 1 or more identical fields

Description

"cg select -g" provides most of the groupby functionality (and a lot extra). "cg groupby" can run using little memory on sorted files, while for select results must however allways fit in memory. Groups lines in a tab separated file, where the value in in the given fields are the same. By default, entries in data fields (= other than the ones on which the groups are made) are concatenated using commas. Only subsequent lines with the same values in fields are grouped! Sort first if you are not sure that they are. If the expected result is not too large to fit in memory, you can use the "-sorted 0" option to avoid sorting first: it wil group same values regardless of the order, keeping the results in memory and writing them when finished.

Arguments

fields
list of fields which should be the same for lines to be groups
infile
file to be scanned, if not given, uses stdin. File may be compressed.
outfile
write results to outfile, if not given, uses stdout

Options

-sumfields fields
make the sum of all values in the given fields in stead of concatenating them using commas. The values in the given fields should be numbers
-f fields
return only the fields (separated by spaces) given by -f in the result (and the group field that will always be returned)
-stats fields
create columns with statistics on the values in the given fields in stead of concatenating them using commas. The values in the given fields should be numbers
-sorted 0/1
if 0, same values do not have to be consequitive, but potentially large amounts of memory may be needed (default is 1)

Category

Query