NCI Guidelines gdata
Gdata is a disk storage with high-speed access. It is meant to be used to store data files used in analysis. This disk is not backed up The location is always /g/dataN with the N identifying physically different disks, all of the Centre’s projects are on /g/data1 or on /g/data3. This means that they are effectively separate filesystems so if you’re moving files from a project in /g/data1 to a project in /g/data3 for example, you are effectively copying the files to a new filesystem. If you are moving files between two projects on the same disk, the transfer will be basically immediate because their physical location won’t change.
Gdata is used for large collections that users access often, so it is appropriate for:
- Published or shared datasets: all the data projects such as ub4 (ERA-Interim), rr7 (Reanalysis products), rr3, al33 and oi10 (CMIP), ua8 (CoE Published Data) have disk storage allocated as they are large, and frequently accessed by a large number of users. It is also necessary for files that are served through web services via THREDDS to be stored on disk with fast access.
- Model output or other collection of files that a user is currently analysing and any intermediate files produced by that analysis.
- Software not otherwise available, although it is always best to check with the CMS and/or NCI if a shared installation is possible
What should not be stored on gdata:
- Your own code files: gdata is not backed up so it’s always better to have them in your home directory or you can have a local copy on gdata but use bitbucket or github as your original repository
- Gdata is not to store model logs and other standard output errors produced by software, unless you are analysing them. If you want to keep these files to have a future reference then you should archive (tar) them and move them to tape (MDSS)
- Any files that are no longer used after you finished to analyse your data you should “clean” it and archive it. If you think you might reuse the data in the future you should evaluate how likely this is and act accordingly. It’s very easy to leave the data there for the moment and never get back to it. Storage is becoming more and more scarce and this is by far the most expensive storage option.
- All netcdf files should be compressed
- Use group permissions wherever you can
- Organise your work space (READMEs, data management plan, etc).
- Clean up regularly
- A big file is anything bigger than several GB.
- Avoid creating any file bigger than 20GB. If you really need to, you should use techniques like:
- Chunking (netcdf4): http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters
- Striping: https://opus.nci.org.au/display/Help/Lustre+Basics