Running the UM with Rose

Revision as of 23:57, 11 December 2019 by S.wales (talk | contribs)

Copying STASH fields

A pair of macros are provided with UM suites for copying STASH settings between jobs, named STASHExport and STASHImport. You can find the macros in the editor under Metadata -> um

Stashexport.png

The exported STASH configuration will be saved into 'app/um/STASHexport.ini}}'. To import the settings into a new job copy this file to the new suite's 'Template:App/um/STASHImport.ini' and run the '{{STASHImport' macro in the Rose editor on that suite.

Restarting and Extending Rose Suites

To restart a stopped job run

rose suite-run --restart

This will put the suite back into the exact same state it was when Cylc stopped - failed tasks will still be failed, and if it reached the end of the run Cylc will promptly stop running again.

Resubmit a failed task by right clicking on it and selecting 'Trigger Task'

To extend the run dates you'll need to change the end time in the Rose editor, and then reload the configuration.

If the configuration has changed (say you have edited the suite end date to make it run for another year) you need to reload the configuration when you restart it, which you can do with

rose suite-run --reload --restart

Resubmitting Tasks

If a UM task has failed (i.e. it has a red box in the Cylc GUI) you can re-submit it by right clicking the task and selecting 'Trigger (run now)'

The task will continue from the most recent restart file, provided that it is not the very first task. Resubmitting the first UM task will restart the run from the beginning.

Porting suites to NCI (work in progress)

Basics

Site specific information goes into the `site}}` directory of the suite. If this already exists follow the convention already in place, otherwise create a file `{{site/nci-raijin.rc` which contains at the minimum:

[ runtime ]
    [[ root ]]
        [[ environment ]]
            UMDIR = /projects/access/umdir
            TIDS = /g/data1/access/TIDS

    [[ ACCESSDEV ]]
        init-script = ""
        [[ job submission ]]
            method = background
        [[ remote ]]
            host = accessdev.nci.org.au

    [[ RAIJIN ]]
        init-script = """
            module purge
            export PATH=~access/bin:$PATH
            export ROSE_VERSION=<span style="font-family:monospace"> ROSE_VERSION </span>
            ulimit -s unlimited
            module load openmpi/1.10.2
            """
        [[ remote ]]
            host = raijin.nci.org.au
        [[ job submission ]]
            method = pbs
        [[ directives ]]
            -P = <span style="font-family:monospace"> NCI_PROJECT | default(environ['PROJECT']) </span>
            -q = <span style="font-family:monospace"> NCI_QUEUE | default('normal') </span>
            -l ncpus = 1
            -l mem = 1gb
            -l walltime = 0:10:00
            -l jobfs = 1gb
            -W umask = 0022

These sections provide default settings for jobs running on NCI servers, running jobs on accessdev (e.g. code downloads) in the background and jobs on raijin in the PBS queue.

To link this into the main suite configuration add a line at the end of `suite.rc`:

{% include 'site/'+SITE+'.rc' %}

and in `rose-suite.conf` add a new Jinja setting

SITE = 'nci-raijin'

With this done Rose and Cylc will load the site configuration, but individual tasks still need to be hooked up. How to do this will depend on the suite layout. As an example the Nested suite has two top-level groups `HOST_LOCAL}}` and `{{HOST_HPC}}` for tasks that should be run on the Cylc server and the HPC respectively, which don't inherit from anything else. In this case you can add to `{{site/nci-raijin.rc`:

[ runtime ]
    [[ HOST_LOCAL ]]
        inherit = ACCESSDEV
    [[ HOST_HPC ]]
        inherit = RAIJIN

Building the UM

A number of extra modules are required to build the UM. The best reference for the current recommendation is the rose-stem suite for the version you are running - https://code.metoffice.gov.uk/trac/um/browser/main/trunk/rose-stem/site/nci/family.rc.

At NCI we use a two-stage build - fcm extracts the code on Accessdev, then copies it over to Raijin where it is built. The configuration might look like:

[ runtime ]
    [ FCM_EXTRACT_RESOURCES ]
        inherit = HOST_LOCAL

    [ FCM_BUILD_RESOURCES ]
        inherit = HOST_HPC
        init-script = """
            module purge
            export PATH=~access/bin:$PATH
            export ROSE_VERSION=<span style="font-family:monospace">ROSE_VERSION</span>
            ulimit -s unlimited
            module load intel-fc/15.0.1.133
            module load intel-cc/15.0.1.133
            module load openmpi/1.10.2
            module load gcom/6.3_ompi.1.10.2
            module load netcdf/4.3.0
            module load grib-api/1.10.4
            module load drhook
            module load fcm
            module load shumlib/2017.06.1
            """

Resources