Template for storage training workflow
Reminder: the idea is to divide the training into different sessions, each session covering a part of a typical workflow.
For each session we should start presenting one or more issues we encounter often, the typical causes and consequences of the issue/s . After this we should cover what should be the right approach and why . Finally running through some practical examples using the commands listed in the skill sets.
Session 1: Running a model and /short
- short is full or nearly full and I don't have space to store my output
- my simulation stopped because it produces more output that can be stored on /short
- passed the inodes quota
- someone is using short to run the analysis on their model output
- someone run a huge simulation and didn't set it up so the output is transferred to gdata regularly
- someone just left their data there forever
- you didn't estimate your model output correctly or at all
We could use a snapshot showing how nearly every week there's a warning project over 90% or full explain what happens when the quota is passed. Explain how short is meant to work why we can't just increase it and correct use. Other significant examples of time wasted??
Potential pratical tasks:
- use nci_account to see what is the current status (skill set #1)
- use short_files_report to se more detailed information and what is your contribution (skill set #1)
- how to calculate your short storage needs
- how to transfer your model output to gdata while running the model
Session 2: Analysing data and /gdata
- I don't have enough space on gdata to run my analysis
- I've been asked to free up space on gdata but I need more to tar my files before archiving them
Causes: (or bad usage examples)
- someone who backup their entire hard disk on gdata
- keeping all the restart files and logs from previous simulations even if you're not going to analyse them
- keeping output of failed simulations (just recently ~ 10Tb in the one project)
- getting your model to output variable or frequencies you will never use in your analysis
- making copies of datasets
- concatenating/subsetting files
- keeping interemediate files when are not needed anymore
Explain characteristic and correct use of gdata, regular checks for files to remove/archive, accesing data in more efficient and streamlined way so as to avoid intermediate files where possible. Documenting and moving data that is shared with wider community to more appropriate location
Potential pratical tasks:
- using nci_account to check current status skilset #1
- use gdata_files_report to se more detailed information and what is your contribution (skill set #1)
- using xarray/cdo to access multiple files?
- using opendap???
- just reviewing these since they were covered in session1
Session3: Clean, archive and mdss
issue: researcher leaves the center without archiving their data, his ex-supervisor ask us to delete the data because he couldn't complete the delete because of wrong permissions are set on the data. When contacted the researcher ask us to archived the data somewhere (too late!) because he still wants to work on it. If the researcher hasn't documented what he has done properly he won't be able to reproduce the data he just lost.