Embarrassingly parallel job

Revision as of 23:29, 11 December 2019 by S.wales (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Let's say you want to run the same command a lot of time with different inputs. Each iteration of the command is totally independent of the other iterations. Or you could have a lot of different scripts you need to run which are independent of each other. You have what we call an embarrassingly parallel problem. This page is to explain how to setup this on PBS scheduler in order to run everything under one PBS job instead of lots of smaller jobs. Running several 1 CPU jobs on the HPC is not efficient as it is asking for more work from the scheduler and it means other people in the same group can run fewer jobs (since there is a limit on the number of jobs in the queue per project). At the same time, bigger jobs tends to get higher priority than smaller jobs hence you could see a decrease in the time spent in the queue.

This type of scheduling work is called array jobs and NCI already gives a lot of information on | their help pages. Here we try to give a few more details.

Run the same scripts with a lot of inputs.

Let's take an example. Let's say you have the script (make sure it is executable for you using chmod):

#!/bin/bash
sleep 10
echo "I am ${HOSTNAME}, my arguments are >$*<"

And you want to run it with the inputs as in the linked file inputs.txt. File:inputs.txt

Scott Wales has written the following example to do this:

#!/bin/bash
# Run an embarrassingly parallel job, where each command is totally independant
# Uses gnu parallel as a task scheduler, then executes each task on the available cpus with pbsdsh

#PBS -q express
#PBS -l ncpus=32
#PBS -l walltime=0:10:00
#PBS -l mem=2gb
#PBS -l wd

module load parallel

SCRIPT=./script.sh  # Script to run.
INPUTS=inputs.txt   # Each line in this file is used as arguments to ${SCRIPT}
                    # It's fine to have more input lines than you have requested cpus,
                    # extra jobs will be executed as cpus become available

# Here '{%}' gets replaced with the job slot ({1..$PBS_NCPUS})
# and '{}' gets replaced with a line from ${INPUTS}
parallel -j ${PBS_NCPUS} pbsdsh -n {%} -- /bin/bash -l -c "${SCRIPT} {}" :::: ${INPUTS}

GNU parallel manages the parallelism while pbsdsh makes sure the command is run on the appropriate node. You want to load the bash in this case because the environment created by pbsdsh is very basic.

Run several scripts

In the previous example, it is assumed the user has 1 script and a lot of inputs. If alternatively, you have several scripts you can modify the command in this way:

parallel -j ${PBS_NCPUS} pbsdsh -n {%} -- /bin/bash -l -c {} :::: ${INPUTS}

You want to load the bash in this case because the environment created by pbsdsh is very basic. And then, you need to put each command line inside inputs.txt as a separate line as in the inputs2.txt below. Note by command line we mean the script name and all arguments needed (again make sure the scripts are executable by you). The inputs.txt lines should look just the same as you would type on a terminal if running each script separately. File:inputs2.txt

Outputs

If your script is writing outputs to the standard output (for log information for example), you need to make sure to handle the outputs correctly. If you'd like each of your scripts to create its own output file, the simplest is probably to use the second method (Run several scripts) and to redirect the outputs for each script in the inputs.txt file. See the inputs3.txt example file below. File:inputs3.txt