Template for storage training workflow
Reminder: the idea is to divide the training into different sessions, each session covering a part of a typical workflow.
For each session we should start presenting one or more issues we encounter often, the typical causes and consequences of the issue/s . After this we should cover what should be the right approach and why . Finally running through some practical examples using the commands listed in the skill sets. Ask questions on why they think something is an issues or what they think caused an issue. Video on bad filesystem use?
Session 1: Running a model and /short
Intro to entire training
- different filesystems - (pre-requisite)
- short is full or nearly full and I don't have space to store my output
- my simulation stopped because it produces more output that can be stored on /short
- passed the inodes quota
- someone is using short to run the analysis on their model output
- someone run a huge simulation and didn't set it up so the output is transferred to gdata regularly
- someone just left their data there forever
- you didn't estimate your model output correctly or at all
- logs/compiled are small but many files degrade filesystem performance
We could use a snapshot showing how nearly every week there's a warning project over 90% or full explain what happens when the quota is passed. Explain how short is meant to work why we can't just increase it and correct use. Other significant examples of time wasted??
Potential practical tasks:
- configure model to your actual needs and use file compression
- make a sorage/cpu-time evaluation
- use nci_account to see what is the current status (skill set #1)
- use short_files_report to se more detailed information and what is your contribution (skill set #1)
- how to calculate your short storage needs
- how to transfer your model output to gdata while running the model
Session 2: Analysing data and /gdata
- I don't have enough space on gdata to run my analysis
- I've been asked to free up space on gdata but I need more space to tar my files before archiving them
- don't have enough memory to analyse my data
Causes: (or bad usage examples)
- someone who backup their entire hard disk on gdata
- keeping all the restart files and logs from previous simulations even if you're not going to analyse them
- keeping output of failed simulations (just recently ~ 10Tb in the one project)
- getting your model to output variable or frequencies you will never use in your analysis
- making copies of datasets
- concatenating/subsetting files
- keeping interemediate files when are not needed anymore
- netcdf files without internal compression
Explain characteristic and correct use of gdata, regular checks for files to remove/archive, accesing data in more efficient and streamlined way so as to avoid intermediate files where possible. Documenting and moving data that is shared with wider community to more appropriate location
Potential pratical tasks:
- using nci_account to check current status skilset #1
- use gdata_files_report to se more detailed information and what is your contribution (skill set #1) (just reviewing these since they were covered in session1)
- optimasition size vs number of files
- NB cdo or other file opration that might uncompress your files
- using opendap???
- using xarray/cdo to access multiple files?
- big file needs special tecnique (striping and/or chunking)
- dusql: can replace several shell and specific accounting tools, if project not CLEx you can use find, du, nci_account etc...
Session3: Clean, archive and mdss
issue: researcher leaves the center without archiving their data, his ex-supervisor ask us to delete the data because he couldn't complete the delete because of wrong permissions are set on the data. When contacted the researcher ask us to archived the data somewhere (too late!) because he still wants to work on it. If the researcher hasn't documented what he has done properly he won't be able to reproduce the data he just lost.
- think ahead: start documenting from the day 1, review regularly
- delete useless files: zero size, restarts, logs, compiled, intermediate steps
- document for retrieval by yourself or others
- tar strategies: together small files, leave big files , group actual data in coherent groups
- tar anything you won't be using for 6 months or more
- keep code separate and on github or equivalent, document your code if you haven't yet
- select output to publish: anything underpinning your papers and anything which can be used by others
- tools: mdssdiff and mdssprep, mdss, tar, nci_account (to check allocation)
- file permissions group readable