GenomeComb
cg command ?joboptions? ...
Several commands understand these job options
Several genomecomb commands (indicated by "supports job options" in their help) consist of subtasks (jobs) that can be run in parallel. If such command is interupted, it can continue processing from the already processed data (completed subtasks) by running the same command. The --distribute or -d option can be used to select alternative processing methods.
Without a -d option (or -d direct) the command will run jobs sequentially, stopping when an error occurs in a job. It will skip any job where the targets were already completed (and the files it uses as input were not changed after this completion). While runing the command maintains a logfile containing one line with information for each job (such as job name, status, starttime, endtime, duration, target files), which can be useful to check when an error occurs. After a succesful run this logfile is removed unless you specifically specified -d direct or -d 1.
An analysis can be run faster by running multiple jobs at the same time using multiple cores/threads on one machine (by giving the number of cores/threads that may be used to the -d option) or nodes on a cluster managed by the grid engine (option sge) or slurm (option slurm)
Some jobs depend on results of previous jobs, and have to wait for these to finish first, so the entire capacity is not allways filled. If an error occurs in a job, the program will still continue with other jobs. Subsequent jobs depending on the results of the failed job will also fail, but others may run succesfully. (In this way -d 1 is different from -d direct).
The command used to start the analysis will produce a logfile (with the extension .submitting) while submitting the jobs. Each job gets a line in the logfile, and also produces a number of small files (runscript, joblogfile, error output) in directories called log_jobs. (The first column in the logfile refers to their location.) These small files are normally removed after a successfull analysis (but not when there were errors, as they may be useful for debugging). When the command is finished submitting, the logfile is updated and gets the extension .running. When the run is completed, the logfile is again updated and gets the extension .finished if no jobs had errors, or .error if it did. You can update the logfile while jobs are running using the command:
cg job_update logfile
Using -d number starts up to number jobs in separate processes managed by the submitting command. The submitting command must keep running untill all jobs are finished. it will output some progress, but you can get more info by updating and checking the log file.
Using "-d sge" the submitting command will submit all jobs to the Grid Engine as grid engine jobs with proper dependencies between the jobs. The program itself will stop when all jobs are submitted. You can check progress/errors by updating and checking the logfile.
Some tools request multiple cores/slots for one job on a grid engine cluster using the parallel environment (pe) local (using -pe local <nr of slots> -R y). This will only work if the pe local is properly defined (it often is, by convention, but not by default). If no pe local is defined, multiple slots are not requested and nodes may be overloaded. if the local pe is defined, but not in a way that allows for this, jobs may not run. The following settings (create with "qconf -ap local" or modify with "qconf -mp local") work on our cluster pe_name local slots 9999999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $pe_slots control_slaves TRUE job_is_first_task TRUE urgency_slots min accounting_summary FALSE
Using "-d slurm" the submitting command will submit all jobs to slurm with proper dependencies between the jobs. The program itself will stop when all jobs are submitted. You can check progress/errors by updating and checking the logfile.
List of all options supported by job enabled commands:
When run on a cluster, it can be important to request enough memory/time. Genomecomb will by default request a larger memory or timeslot for larger jobs. It is however very difficult to get the right amount for all occasions (as it can be highly dependent on data). Many genomcomb tools have the options -mem and -time to increase the amount requested (vs the default). However not all, so the following two generic job options can change these requested amounts for any job:
-dmem '* 2G map_bwa 10G' -dtime '*2:00:00 map_bwa 10:00:00'
will set the default requested memory to 2G and time to 2h, but for mapping with bwa 10G of memory will be re quested and 10h.
Format Conversion