GenomeComb

fastq_split

Format

cg fastq_split ?options? infile outtemplate

Summary

split a (large) fastq file in multiple smaller ones

Description

cg fastq_split will split a (large) fastq file in several smaller fastq files. You can either specify the target number of sequences per output file (-numseq option), or the target number of files you want to generate (-parts option). outtemplate is a filename (which may contain a path) that determines the names of the files the parts of the fastq file will be written to. The file part will be prepended with p<part>_., e.g. if outtemplate is results/test.fastq.gz, the results files will be named: results/p1_test.fastq.gz results/p2_test.fastq.gz ...

The command supports parallel processing using the joboptions, mainly for the second part of the command, namely compressing the resulting fastq files. The first part (splitting up the file) is not parallellized, but can be run as a single job on a cluster when using the -parts option. Using -numseq, the first part is always run directly (not as a job), because the command cannot know up front how many parts will be generated, and thus not submit the follow up compression jobs

Arguments

infile ...
fastq file to split
outtemplate ...
template for filenames that will be generated

Options

-parts number
number of output files to generate (this option has preference)
-numseq number
number of sequences to store per output file
-maxparts number
maximum number of parts that will be generated (if numseq is small, last part will contain all remaining sequenced)
-threads number
number of threads to use to compress the output files (bgzip -@)

Category

Conversion