Difference between revisions of "Storage-workflow"

Line 43: Line 43:
 
*keeping output of failed simulations (just recently ~ 10Tb in the one project)  
 
*keeping output of failed simulations (just recently ~ 10Tb in the one project)  
 
*getting your model to output variable or frequencies you will never use in your analysis  
 
*getting your model to output variable or frequencies you will never use in your analysis  
*...
+
*making copies of datasets
 +
*concatenating/subsetting  files
 +
*keeping interemediate files when are not needed anymore
 +
 
 +
Explain charcteristic and correct use of gdata, regular checks for files to remove/archive, accesing data in more efficient and streamlined way so as to avoid intermediate files where possible. Documenting and moving data that is shared with wider community to more appropriate location 
  
 
<u>Potential pratical tasks</u>:
 
<u>Potential pratical tasks</u>:
  
*using nci_account (already covered in session 1) to check current status skilset #1  
+
**using nci_account to check current status skilset #1&nbsp;
 +
**use gdata_files_report to se more detailed information and what is your contribution (skill set #1)
 +
*using xarray/cdo to access multiple files?
 +
*using opendap???
 +
 
 +
* just reviewing these since they were covered in session1
  
 
'''Session3: Clean, archive and mdss'''
 
'''Session3: Clean, archive and mdss'''
  
 
&nbsp;
 
&nbsp;

Revision as of 00:12, 15 March 2019

Template for storage training workflow

Reminder: the idea is to divide the training in idfferent sessions, each session covering a part of a typical workflow.

For each session we should start presenting one or more issues we encounter often, the typical causes and consequences of the issue/s . After this we should cover what should be the right approach and why . Finally running through some practical examples using the commands listed in the skill sets.

Session 1: Running a model and  /short

Problems

  • short is full or nearly full and I don't have space to store my output
  • my simulation stopped because it produces more output that can be stored on /short 
  • passed the inodes quota

Causes:

  • someone is using short to run the analysis on their model output
  • someone run a huge simulation and didn't set it up so the output is transferred to gdata regularly
  • someone just left their data there forever 
  • you didn't estimate your model output correctly or at all

We could use a snapshot showing how nearly every week there's a warning project over 90% or full explain what happens when the quota is passed.  Explain how short is meant to work why we can't just increase it and correct use. Other significant examples of time wasted??

Potential pratical tasks:

  • use nci_account to see what is the current status (skill set #1)
  • use short_files_report to se more detailed information and what is your contribution (skill set #1)
  • how to calculate your short storage needs
  • how to transfer your model output to gdata while running the model

Session 2: Analysing data and  /gdata

Problems:

  • I don't have enough space on gdata to run my analysis 
  • I've been asked to free up space on gdata but I need more to tar my files before archiving them

Causes: (or bad usage examples)

  • someone who backup their entire harddisk on gdata
  • keeping all the restart files and logs from previous simulations even if you're not going to analyse them
  • keeping output of failed simulations (just recently ~ 10Tb in the one project)
  • getting your model to output variable or frequencies you will never use in your analysis
  • making copies of datasets
  • concatenating/subsetting  files
  • keeping interemediate files when are not needed anymore

Explain charcteristic and correct use of gdata, regular checks for files to remove/archive, accesing data in more efficient and streamlined way so as to avoid intermediate files where possible. Documenting and moving data that is shared with wider community to more appropriate location 

Potential pratical tasks:

    • using nci_account to check current status skilset #1 
    • use gdata_files_report to se more detailed information and what is your contribution (skill set #1)
  • using xarray/cdo to access multiple files?
  • using opendap???
  • just reviewing these since they were covered in session1

Session3: Clean, archive and mdss