Diagnosing results/status of lots of LSF jobs
Over the past few months I’ve found myself running large numbers of jobs over an LSF system, for example assembling and annotating thousands of bacterial genomes or imputing thousands of human genomes in 5Mb chunks.
Inevitable some of these jobs fail, and often for a number of reasons. I thought it might be helpful to share some of the commands I’ve found useful for diagnosing the jobs that have finished. The commands apply to IBM platform LSF (bsub), but I imagine have slightly wider applicability
bjobs -a -x
This command is useful if run just after jobs finish, so that they are still in the history (they are usually cleared after a couple of hours). It will show all jobs that have finished with a non-zero exit code, and also jobs which have underrun/overrun. This is especially useful if you’ve run something that has exited with an error early on, but still returns exit code 0 (e.g. wrong command line parameters).
find . -name "\*.o" | xargs grep -L "Successfully completed"
Assuming all your job STDOUT files have the suffix .o (bsub -o), this will show any jobs (files at least) that have not finished with exit code 0. find - returns all files names which end with .o, searching recursively xargs - passes these file names one by one to grep grep -L returns the file names of any files which do not contain the given phrase
find . -name "\*.o" | xargs grep -l "MEMLIMIT"
Similar to the above command, except returns all those jobs that exceeded their memory limit. grep -l
returns files with the match. Makes it easy to find jobs which just need to be resubmitted with higher memory limits.
This and the above command can obviously be simply extended by grepping for different strings in the log files
find . -name "\*.e" | xargs cat
Useful for some tasks, this will display all the output to STDERR assuming you wrote it to files with the suffix .e (bsub -e). Some software writes logs to STDERR, but in some cases you might expect this command to return no text