Storage-workflow

Template for storage training workflow

Reminder: the idea is to divide the training into different sessions, each session covering a part of a typical workflow.

For each session we should start presenting one or more issues we encounter often, the typical causes and consequences of the issue/s . After this we should cover what should be the right approach and why . Finally running through some practical examples using the commands listed in the skill sets. Ask questions on why they think something is an issues or what they think caused an issue. Video on bad filesystem use?

Session 1: Running a model and /short

Intro to entire training

different filesystems - (pre-requisite)

Problems:

short is full or nearly full and I don't have space to store my output
my simulation stopped because it produces more output that can be stored on /short
passed the inodes quota

Causes:

someone is using short to run the analysis on their model output
someone run a huge simulation and didn't set it up so the output is transferred to gdata regularly
someone just left their data there forever
you didn't estimate your model output correctly or at all
logs/compiled are small but many files degrade filesystem performance

We could use a snapshot showing how nearly every week there's a warning project over 90% or full explain what happens when the quota is passed. Explain how short is meant to work why we can't just increase it and correct use. Other significant examples of time wasted??

Potential practical tasks:

configure model to your actual needs and use file compression
make a sorage/cpu-time evaluation
use nci_account to see what is the current status (skill set #1)
use short_files_report to se more detailed information and what is your contribution (skill set #1)
how to calculate your short storage needs
how to transfer your model output to gdata while running the model

Session 2: Analysing data and /gdata

Problems:

I don't have enough space on gdata to run my analysis
I've been asked to free up space on gdata but I need more space to tar my files before archiving them
don't have enough memory to analyse my data

Causes: (or bad usage examples)

someone who backup their entire hard disk on gdata
keeping all the restart files and logs from previous simulations even if you're not going to analyse them
keeping output of failed simulations (just recently ~ 10Tb in the one project)
getting your model to output variable or frequencies you will never use in your analysis
making copies of datasets
concatenating/subsetting files
keeping interemediate files when are not needed anymore
netcdf files without internal compression

Explain characteristic and correct use of gdata, regular checks for files to remove/archive, accesing data in more efficient and streamlined way so as to avoid intermediate files where possible. Documenting and moving data that is shared with wider community to more appropriate location

Potential pratical tasks:

- using nci_account to check current status skilset #1
- use gdata_files_report to se more detailed information and what is your contribution (skill set #1) (just reviewing these since they were covered in session1)
optimasition size vs number of files
nccompress
NB cdo or other file opration that might uncompress your files
using opendap???
using xarray/cdo to access multiple files?
big file needs special tecnique (striping and/or chunking)
dusql: can replace several shell and specific accounting tools, if project not CLEx you can use find, du, nci_account etc...

Session3: Clean, archive and mdss

issue: researcher leaves the center without archiving their data, his ex-supervisor ask us to delete the data because he couldn't complete the delete because of wrong permissions are set on the data. When contacted the researcher ask us to archived the data somewhere (too late!) because he still wants to work on it. If the researcher hasn't documented what he has done properly he won't be able to reproduce the data he just lost.

practical task:

think ahead: start documenting from the day 1, review regularly
delete useless files: zero size, restarts, logs, compiled, intermediate steps
document for retrieval by yourself or others
tar strategies: together small files, leave big files , group actual data in coherent groups
tar anything you won't be using for 6 months or more
keep code separate and on github or equivalent, document your code if you haven't yet
select output to publish: anything underpinning your papers and anything which can be used by others
tools: mdssdiff and mdssprep, mdss, tar, nci_account (to check allocation)
file permissions group readable

Anonymous

Search

Navigation

Site Navigation

Models

Links

Navigation

Wiki tools

Wiki tools

Storage-workflow

Namespaces

Page actions

Anonymous

Search

Navigation

Wiki tools

Page tools

Categories

Storage-workflow