Recovering from failed jobs

Sometimes simulations will fail, either by running out of allocated resources or perhaps due to a compute node failing. It is possible to recover from such failures by restarting from an output dump if one is available.

Restarting from an intermediate dump works the same as a continuation run (CRUN). Add the hand-edit file ~access/crun.ed to the job if not already present, then you will be able to resubmit the job. It will restart from the latest dump (*.da) file. There is no need to alter the run start time or run duration, it will simply continue from where it left off (You also don't have to enable automatic re-submission, restarting works without it).

Continuing from a different dump

To continue the run from a dump that isn't the most recent edit the file $RUNID.phist (found in the model run directory), changing the value of ARESTART to point to the dump you'd like to restart from. Once this is done submit a CRUN following the instructions above

Anonymous

Search

Navigation

Site Navigation

Models

Links

Navigation

Wiki tools

Wiki tools

Recovering from an interrupted job

Namespaces

Page actions

Recovering from failed jobs

Continuing from a different dump

Anonymous

Search

Navigation

Wiki tools

Page tools

Categories

Recovering from an interrupted job

Recovering from failed jobs

Continuing from a different dump