Difference between revisions of "Storage"

 
(2 intermediate revisions by the same user not shown)
Line 29: Line 29:
 
*<span style="font-size:medium">It is most suitable for storing source code rather than model outputs or observation datasets. We still encourage you to also use Git to manage your source code and to save it to Github.</span>  
 
*<span style="font-size:medium">It is most suitable for storing source code rather than model outputs or observation datasets. We still encourage you to also use Git to manage your source code and to save it to Github.</span>  
 
*<span style="font-size:medium">You can monitor your use on home with the "quota" command. You may have errors at log in when your home directory is full.</span>  
 
*<span style="font-size:medium">You can monitor your use on home with the "quota" command. You may have errors at log in when your home directory is full.</span>  
 +
 +
  
 
=== <span style="font-size:x-large">/scratch</span> ===
 
=== <span style="font-size:x-large">/scratch</span> ===
Line 38: Line 40:
  
 
*<span style="font-size:medium">All projects have some storage on /scratch.</span>  
 
*<span style="font-size:medium">All projects have some storage on /scratch.</span>  
 +
*<span style="font-size:medium">It is the recommended filesystem to write outputs from your simulations into from the compute nodes.</span>
 
*<span style="font-size:medium">Unaccessible from OOD</span>  
 
*<span style="font-size:medium">Unaccessible from OOD</span>  
 
*<span style="font-size:medium">The amount of space varies from a project to another.</span>  
 
*<span style="font-size:medium">The amount of space varies from a project to another.</span>  
Line 57: Line 60:
 
*<span style="font-size:medium">The quota on /g/data is per project with management of the usage the sole responsibility of the project's members.</span>  
 
*<span style="font-size:medium">The quota on /g/data is per project with management of the usage the sole responsibility of the project's members.</span>  
 
*<span style="font-size:medium">Accessible from OOD</span>  
 
*<span style="font-size:medium">Accessible from OOD</span>  
*<span style="font-size:medium">For compute nodes to have read and/or write access to this filesystem, you need to mount the specific area you need with the "-l storage" PBS flag</span>  
+
*<span style="font-size:medium">For compute nodes to have read and/or write access to this filesystem, you need to mount the specific area you need with the "-l storage" PBS flag. Remember do not write the outputs of large climate simulations to /g/data directly. Only write from the compute nodes to this space from an analysis job for example.</span>  
 
*<span style="font-size:medium">To monitor usage, please use <tt>nci_account</tt>. To see a per-user summary run the command <tt>nci-files-report -f gdata</tt></span>  
 
*<span style="font-size:medium">To monitor usage, please use <tt>nci_account</tt>. To see a per-user summary run the command <tt>nci-files-report -f gdata</tt></span>  
  

Latest revision as of 23:32, 26 April 2022

During your work at the Centre, you are likely to produce, use and share data on different systems. You will probably have access to two different systems: your University system and NCI.

 

Storage on Gadi

On Gadi, NCI has four different filesystems: $HOME, /scratch, /g/data and mdss. mdss is a tape-based filesystem while $HOME, /scratch and /g/data are disk-based.

There are also specific filesystems linked to the Open On Demand system which are less used. On this page, we will only talk about the storage available from Gadi.


Hierarchy of filesystems

The different filesystems are best thought of as a hierarchy, each with a specific usecase:

  • $HOME: this is a small disk space that is backed up by NCI. It is best suited for important, hard to reproduce files, e.g. data analysis codes.
  • /scratch: this is a temporary disk space. Files are automatically deleted on some condition. It is best suited for raw output from climate models.
  • /g/data: this is a permanent disk space. It is best suited for data that is used over a long time, e.g. climate models inputs, climate models outputs that is being analysed, climate models source code that needs to be compiled.
  • mdss: this is a tape space. It is best suited as a backup of important files or an archiving system.

This means you need to implement a data management workflow that is compatible with the filesystems specifications, your scientific project and the management of your NCI's project. We have a blog post that can help with identifying the right questions and strategies for your data.

$HOME

  • This is your home directory on Gadi.
  • Unaccessible from OOD
  • This space is strictly limited at 10GB for each user but it is backed up by NCI.
  • It is most suitable for storing source code rather than model outputs or observation datasets. We still encourage you to also use Git to manage your source code and to save it to Github.
  • You can monitor your use on home with the "quota" command. You may have errors at log in when your home directory is full.


/scratch

Warning: this is a temporary space with automatic deletion. All users are responsible to learn about the deletion process.
  • All projects have some storage on /scratch.
  • It is the recommended filesystem to write outputs from your simulations into from the compute nodes.
  • Unaccessible from OOD
  • The amount of space varies from a project to another.
  • The total space allocation is shared between all project members.
  • The management of the space is left to the responsibility of the members of each project.
  • There is an automated file management system in place. It will automatically remove any file that has not been accessed in the last 100 days.
  • When a project fills its quota on /scratch, the project's members will not be able to use the computing queues except for copyq queues to help with moving data around.
  • To monitor the overall usage, please use the nci_account command. To monitor the usage per user, please use nci-files-report -f scratch.

Additional storage

Increasing the quota on /scratch for a project might be possible but is left to the decision of NCI staff. An extension can only be requested by the Lead CI for the project. You can check who the Lead CI is on my.nci.org.au.

 


/g/data

  • The quota on /g/data is per project with management of the usage the sole responsibility of the project's members.
  • Accessible from OOD
  • For compute nodes to have read and/or write access to this filesystem, you need to mount the specific area you need with the "-l storage" PBS flag. Remember do not write the outputs of large climate simulations to /g/data directly. Only write from the compute nodes to this space from an analysis job for example.
  • To monitor usage, please use nci_account. To see a per-user summary run the command nci-files-report -f gdata

Additional storage

A project's quota on /g/data cannot be increased with a simple request. There is a review of some of the quotas every year, at which point some projects might be granted an increase. If you need additional storage before then, please consider deleting old or incorrect data, archiving old data to massdata, using temporary storage or your University system. If you still want an increase to be considered at review time, please make sure to discuss it with the Lead CI of your project who will be part of the review.

Tape mdss

The tape system at NCI is called mdss or massdata. Please read the archiving data wiki page to learn how to use this system.

  • Tape access (writing and reading) is slow.
  • Unaccessible from OOD
  • Tape is mostly appropriate for backing up or archiving data.
  • You should only store big files on tape. If you want to migrate a lot of small files, you should first archive them together. To learn how to do that please have a look at "File compression and archiving" section below and email your questions to us.
  • massdata is only accessible from the login nodes (interactive) or via a script submitted to the copyq queue. It is only accessible via specific commands which are detailed in the archiving data wiki page. It is generally recommended to use the copyq queue as you then have a much longer run time.
  • Considering data on massdata are likely to be unused for a long time it is quite essential to document your data. For example adding a detailed README file to your data folder can help a lot.

Use cases for the various filesystems

Keep on scratch only

  1. Producing large amounts of temporary data that will be deleted once analysis is completed
  2. Data duplicated from another site (or from massdata) used regularly as input for models or analysis, so access time is constantly updated and they can be copied again if deleted
  3. Temporary run directory for a simulation.
  4. PBS log files and other log files from a simulation or analysis

Keep on scratch but migrate when finished(/g/data, /scratch, off-site, tape)

  1. Climate model output or other data that will be analysed and archived in a short time frame, within the 100 day expiry time limit, taking into account that accessing the files will reset the expiry time
  2. Figures in some cases. If you are producing a very large number of figures and only keep a few for the long term.

Create initially on scratch but migrate to /g/data automatically and delete from scratch

  1. Long climate model runs that might take longer than the 100 day expiry date to complete, where there is no possibility to analyse the data before the run is complete
  2. Shared climate model runs that may well be accessed by a wide variety of people but the access patterns are not predictable. e.g. CMIP, CORDEX, COSIMA outputs

Create in /home or /g/data directly

  1. Any code: your own analysis code, climate models source codes
  2. External packages installed locally
  3. DO NOT RUN CLIMATE MODELS AND OUTPUT DIRECTLY TO EITHER OF THESE LOCATIONS. This is an anti-pattern, it is not performant and NCI will become quite annoyed as /scratch is designed for this use-case and /g/data and /home are not.


Temporary storage

The CMS is also managing two projects on NCI that can be used for temporary storage of data. Both are mounted on /g/data and have the same characteristic as other /g/data storage space as explained above.

The temporary storage projects are:

  • /g/data/hh5: this project is for short temporary use (~3 months). It could be used for example to print your raw model outputs, then you would save a subset or a reformatted version to your project's space and move the raw outputs to massdata for safekeeping.
  • /g/data/ua8: the main purpose of this project is to store replica of datasets which do not have a specific data project assigned. However, the free space in this project can be used as temporary storage for data that is being processed for publication.

Request access

To use any space on these projects, you need to:

  • request connection to the project if you are not yet a member. You can check which projects you are part of with the "groups" command.
  • fill a storage request using the CLEX DMPonline tool. If you do not yet have an account on this tool, please be patient for the account creation. To avoid robots and unauthorised access, the account creation requires human verification on our end. Please email us if you have any question about filling in the form. Note this form is principally to enable us to monitor the space used, requested and available. It also enables us to prepare a folder for you with appropriate permissions. The forms are very short and quick to fill and the storage is usually ready for use within a few hours. See this page for more detailed instructions on how to fill the form. Note:
    • you do not need to email us, please just fill in the form to request the allocation you would like.
    • also, you can request space for use by a whole group instead of per user, but all users of the group must request connection to the project.

Storage at Universities

at ANU

at Monash

at UMelb

at UNSW

at UTAS


File compression and archiving

See our Archiving Data page for more details. Below are a few rules and pointers which are important to know before starting any work.

For an efficient use of storage, there are a few rules to keep in mind:

  • it is more efficient to store a few larger files than lots of smaller files. It is hard to define large and small but files of several tens of gigabytes are absolutely acceptable. The size of the files should clearly also take into account how you or others are going to use them. File nearing 100 GB become unmanageable and should be produced only if there is no other option.
  • It is always best practice to compress your data when possible. Netcdf files are now easily compressible, see this wiki page for detailed explanations on tools available at NCI.

To store small setup files that define your experiments, think about using the "tar" command. Here is a cheat sheet for tar. This is a shell command with a manual accessible through

man tar

This command will save many files together in a single archive, it can be used on a directory tree and will restore the directory structure when restoring the files from the archive. This means if you have several experiments you need to save the setup of, the best way might be to create a directory tree containing the setup files of all the experiments then create one single archive file for all. The archive files can also easily be compressed/uncompressed using the gzip utility either at the archive creation time or afterwards.