Difference between revisions of "Recovering from an interrupted job"

(Imported from Wikispaces)
Line 1: Line 1:
[[Category: Unified Model]]
=Recovering from failed jobs=  
=Recovering from failed jobs=  

Latest revision as of 01:52, 18 March 2019

Recovering from failed jobs

Sometimes simulations will fail, either by running out of allocated resources or perhaps due to a compute node failing. It is possible to recover from such failures by restarting from an output dump if one is available.

Restarting from an intermediate dump works the same as a continuation run (CRUN). Add the hand-edit file ~access/crun.ed to the job if not already present, then you will be able to resubmit the job. It will restart from the latest dump (*.da) file. There is no need to alter the run start time or run duration, it will simply continue from where it left off (You also don't have to enable automatic re-submission, restarting works without it).


Continuing from a different dump

To continue the run from a dump that isn't the most recent edit the file $RUNID.phist (found in the model run directory), changing the value of ARESTART to point to the dump you'd like to restart from. Once this is done submit a CRUN following the instructions above