Latest revision as of 01:52, 18 March 2019

Recovering from failed jobs

Sometimes simulations will fail, either by running out of allocated resources or perhaps due to a compute node failing. It is possible to recover from such failures by restarting from an output dump if one is available.

Restarting from an intermediate dump works the same as a continuation run (CRUN). Add the hand-edit file ~access/crun.ed to the job if not already present, then you will be able to resubmit the job. It will restart from the latest dump (*.da) file. There is no need to alter the run start time or run duration, it will simply continue from where it left off (You also don't have to enable automatic re-submission, restarting works without it).

Continuing from a different dump

To continue the run from a dump that isn't the most recent edit the file $RUNID.phist (found in the model run directory), changing the value of ARESTART to point to the dump you'd like to restart from. Once this is done submit a CRUN following the instructions above

Revision as of 01:50, 21 April 2015 (view source) ScottWales (talk \| contribs) (Imported from Wikispaces)		Latest revision as of 01:52, 18 March 2019 (view source) S.wales (talk \| contribs)
Line 1:		Line 1:
		+	[[Category: Unified Model]]
		+
	=Recovering from failed jobs=		=Recovering from failed jobs=

Anonymous

Search

Navigation

Site Navigation

Models

Links

Navigation

Wiki tools

Wiki tools

Difference between revisions of "Recovering from an interrupted job"

Namespaces

Page actions

Latest revision as of 01:52, 18 March 2019

Recovering from failed jobs

Continuing from a different dump

Anonymous

Search

Navigation

Wiki tools

Page tools

Categories

Difference between revisions of "Recovering from an interrupted job"

Latest revision as of 01:52, 18 March 2019

Recovering from failed jobs

Continuing from a different dump