Recovering from an interrupted job

Revision as of 01:52, 18 March 2019 by (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Recovering from failed jobs

Sometimes simulations will fail, either by running out of allocated resources or perhaps due to a compute node failing. It is possible to recover from such failures by restarting from an output dump if one is available.

Restarting from an intermediate dump works the same as a continuation run (CRUN). Add the hand-edit file ~access/crun.ed to the job if not already present, then you will be able to resubmit the job. It will restart from the latest dump (*.da) file. There is no need to alter the run start time or run duration, it will simply continue from where it left off (You also don't have to enable automatic re-submission, restarting works without it).


Continuing from a different dump

To continue the run from a dump that isn't the most recent edit the file $RUNID.phist (found in the model run directory), changing the value of ARESTART to point to the dump you'd like to restart from. Once this is done submit a CRUN following the instructions above