Restarting UM Runs
UM runs over long simulation times normally run as a single job can be done by submitting a sequence of smaller jobs that will resubmit themselves. This will be useful for large simulations which have wall times longer than the PBS queue allows. This process, referred to as automatic resubmission, removes the need for a user to manage this process manually.
Overview of the process
Re-submitted runs are run in sections. Each section of the run will use the same settings, there's no need to change the start dates or run times in between sections. The model is restarted from the last dump file. After each section is finished for the next section to start off from, you won't need to update ancillary files in between sections either. Basically, the process involves four basic steps:
- Set up the total run length then enable and configure some settings in the resubmission window.
- Submit the first section, called the Initial or New run (NRUN).
- Include a hand-edit file that will specify that subsequent runs will be Continuation runs (CRUN).
- Submit the job again. This will commence the automatic resubmission process and jobs will be resubmitted until the Target run length is reached.
Total run length
The start date and total run length of the entire simulation should be entered in Input/Output Control -> Start Date and Run Length
In this example the simulation will cover 1 month starting at 1/10/1978.
Section run length
The run time for each section is entered in Input/Output Control -> Job submission. Once on the job submission screen press NEXT to reach the re-submission panel.
You'll need to enable the automatic re-submission checkbox, then specify how long each section of the job should run for. You may also need to change the job time limit.
In this example each section will run for 5 days, so there will be 6 sections to make up a 1 month simulation.
You'll also need to make sure a model dump is written after each section for the model to be restarted from. In Atmosphere->Control->Post processing->Dumping and Meaning check how often restart dumps are being written, the section length should be a multiple of the regular dump frequency to ensure a restart dump is written on the last time-step of each section.
In this example a restart dump is written every 5 days, at the end of each run section.
The initial run will run the reconfiguration on the ancillary files and initial conditions, run the model for a single section and output a restart file from which the job can be continued. For this run the ancillary file ~access/crun.ed must be disabled
Once the initial run has completed further runs can be continued from the restart files. The handedit ~access/crun.ed makes the job read from restart files instead of reconfiguring the initial conditions. The model will run for a single section and then output the next checkpoint, then resubmit itself to the queue to run the next segment. This will continue until the model has run for the total length of the simulation.
Extending the run length
You are able to extend the total run duration by setting a longer total run length in the UMUI once you've started the run. Either wait for the current section to complete or manually stop the run using 'qdel'. Then increase the run length setting in the UMUI and process & submit as a CRUN. Be sure to check that any ancillary files that you are using will be valid for the new run duration.