Revision as of 01:52, 7 June 2017 by Hwolff (talk | contribs) (Imported from Wikispaces)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


This is an as-of-yet unlisted document to chart my progress with ACCESS-S. Once it's running, hopefully this will make it easier to convert it into a document.

Getting ACCESS-S

This is the initial email I got from Hailin:

Hi Holger,

If you'd like to play the ACCESS-S1 suite, you can copy my suite au-aa563.

The followings are the key parameters to run the suite.

In rose-suite.conf:
MAKE_BUILDS=true               #set true to compile source codes
N_GSHC_MEMBERS=3          #num of ensemble members, as for MEMBERS=-m in app/glosea_init_cntl_file/rose-app.conf
N_GSHC_STEPS=2                  #number of RESUBMIT (number of chunk runs)
RESUB_DAYS=1                     #number of days per chunk run

In app/glosea_init_cntl_file/rose-app.conf:
GS_HCST_START_DATE=1990050100   #start date, it is 01 of May in this case
MEMBERS=-m 3                   #total number of ensembles, must be the same as the N_GSHC_MEMBERS in rose-suite.conf
GS_YEAR_LIST=1997            #the year of the run

After you compiled the codes and run the job successfully, you could maintain your own INSTALL_DIR which is defined in suite.rc:
INSTALL_DIR = "/short/dx2/hxy599/gc2-install"

If you have any problems please let me know.


So I made a copy of that, new rose is au-aa566, most of the things were already set to the values that Hailin initiated in his email. I've changed the INSTALL_DIR}} to {{/short/${PROJECT}/${USER}/gc2-install}} but I'm also not a member of the group Template:Dx2 or Template:Ub7, so I'm also trying to copy the Template:DUMP DIR and Template:DUMP DIR BOM directories to my {{/short/${PROJECT}/${USER}/dump}} and {{/short/${PROJECT}/${USER}/dump-bom, respectively, but there is 27TB of data, and I can't do that.

Getting ACCESS-S to run

I've copied the job, and just tried to run it, but it failed with error messages, culminating in Illegal item: [scheduling]initial cycle time

The solution to this is to use older versions of CYLC and ROSE with this command:

$ CYLC_VERSION=6.9.1 ROSE_VERSION=2016.06.1 rosie go

First hurdles:

  1. gsfc_get_analysis gets a submit-failed
  2. GSHC_M1-3 get failed

For now, I've reset the suite.rc to point to the BoM directories, to see whether that changes anything -- It didn't

Looking at the job activity log and the job itself of gsfc_get_analysis}}, I notice strange PBS directives: Template:ConsumableMemory(2GB) and Template:Wall clock limit. I find these strings in suite.rc, and replace them with Template:-l vmem=2GB and Template:-l walltime=01:11:00. (I also find another reference to these Values for {{glosea_joi_prods, and change them as well.)

This seems to have succeeded for gsfc_get_analysis}}, but the {{GSHC_M1-3 still fail. I found this error message:

???!!!???!!!???!!!???!!!???!!! ERROR        ???!!!???!!!???!!!???!!!???!!!???!!!
?  Error   Code:    19
?  Error   Message:  Error reading namelist NLSTCALL. Please check input list against code.
?  Error   from processor:     0
?  Error   number:     0

It seems in the namelist entered, there's a value for control_resubmit}}, which the UM doesn't understand. Since rose considers this variable to be compulsory, I've had to remove it from the file {{~roses/au-aa566/app/coupled/rose-app.conf, and now I've submitted it again. (Or I could have disabled all metadata from the menu option...)

Second issues:

gsfc_get_analysis fails at the end, but it seems that it's not doing all that much:

                  Resource Usage on 2017-06-06 15:39:23:
   Job Id:             5497709.r-man2
   Project:            w35
   Exit Status:        1
   Service Units:      1.24
   NCPUs Requested:    1                      NCPUs Used: 1
                                           CPU Time Used: 00:00:03
   Memory Requested:   500.0MB               Memory Used: 9.56MB
   Walltime requested: 01:11:00            Walltime Used: 01:14:15
   JobFS requested:    100.0MB                JobFS used: 0B

CPU time used is only 3 seconds, while it ran out of walltime after almost 1h15m.

So it seems that, since SUITE_TYPE}} is set to Template:Research (and thereby {{GS_SUITE_TYPE is also research, some environment variables are set to directories that might exist on the MetOffice computer, but not on raijin:

{%- if RUN_GSFC or RUN_GSMN %}
        environment scripting = """eval $(rose task-env)
                                   export SHORT_DATE=${ROSE_TASK_CYCLE_TIME%%00}"""
            ROSE_TASK_APP    = glosea_get_fcst_analyses
            {% if GS_SUITE_TYPE != 'research' %}
              FOAM_SUITE_NAME   = $(os_get_suiteid --mode=<span style="font-family:monospace"> SUITE_TYPE </span> ocean)
              GLOBAL_SUITE_NAME = $(os_get_suiteid --mode=<span style="font-family:monospace"> SUITE_TYPE </span> global)
            {% else %}
              ROSE_DATAC_GLOBAL = /critical/opfc/suites-oper/global/share/data/${ROSE_TASK_CYCLE_TIME}
              ROSE_DATAC_FOAM   = /critical/opfc/suites-oper/ocean/share/data/${ROSE_TASK_CYCLE_TIME}
            {% endif %}
            -l = "vmem=2GB,walltime=01:11:00"
#            resources        = ConsumableMemory(2Gb)
#            wall_clock_limit = "01:11:00,01:10:00"

For now I replaced the else clause above with the same data from as the original and try again.

Full Reset

Scott noticed that there were some new changes to the configuration file, namely RUN_GSFC}} and {{RUN_GSMN were set to true.

Since I couldn't remember ever changing them, I just made a full reset, changed and only changed the project.