Difference between revisions of "Storage"

(Temporary storage)
Line 65: Line 65:
  
 
*request [https://my.nci.org.au/mancini/login?next=/mancini/project/ connection to the project] if you are not yet a member. You can check which projects you are part of with the "groups" command.  
 
*request [https://my.nci.org.au/mancini/login?next=/mancini/project/ connection to the project] if you are not yet a member. You can check which projects you are part of with the "groups" command.  
*fill a storage request using the [https://clex.dmponline.cloud.edu.au/ DMPonline tool], please [mailto:climate_help@nci.org.au email us] if you have any question about filling in the form. Note this form is principally to enable us to monitor the space used, requested and available. It also enables us to prepare a folder for you with appropriate permissions. The forms are very short and quick to fill and the storage is usually ready for use within a few hours. See [[Storage-request|this page]] for more detailed instructions on how to fill the form. Note:  
+
*fill a storage request using the[https://clex.dmponline.cloud.edu.au/ DMPonline tool]. If you do not yet have an account on this tool, please be patient for the account creation. To avoid robots and unauthorised access, the account creation requires human verification on our end. Please [mailto:cws_help@nci.org.au email us] if you have any question about filling in the form. Note this form is principally to enable us to monitor the space used, requested and available. It also enables us to prepare a folder for you with appropriate permissions. The forms are very short and quick to fill and the storage is usually ready for use within a few hours. See [[Storage-request|this page]] for more detailed instructions on how to fill the form. Note:  
**you do not need to email us, please just fill in the form to request the allocation you would like.  
+
**you do not need to email us, please just fill in the form to request the allocation you would like.  
 
**also, you can request space for use by a whole group instead of per user, but all users of the group '''must''' request connection to the project.   
 
**also, you can request space for use by a whole group instead of per user, but all users of the group '''must''' request connection to the project.   
  
Line 72: Line 72:
  
 
*/g/data3/hh5: this project is for short temporary use (~3 months). It could be used for example to print your raw model outputs, then you would save a subset or a reformatted version to your project's space and move the raw outputs to /massdata for safekeeping.  
 
*/g/data3/hh5: this project is for short temporary use (~3 months). It could be used for example to print your raw model outputs, then you would save a subset or a reformatted version to your project's space and move the raw outputs to /massdata for safekeeping.  
*/g/data1/ua8: the main purpose of this project is to store published datasets created by the Centre's staff. For example, some journals now request that researchers publish their data in parallel to their papers. It is also used for small downloaded datasets that are shared across the CoE and do not have their own "project". However, the free space in this project can be used as temporary storage for data that is being processed for publication.  
+
*/g/data1/ua8: the main purpose of this project is to store published datasets created by the Centre's staff. For example, some journals now request that researchers publish their data in parallel to their papers. It is also used for small downloaded datasets that are shared across the CoE and do not have their own "project". However, the free space in this project can be used as temporary storage for data that is being processed for publication.
  
 
= Storage at Universities =
 
= Storage at Universities =

Revision as of 01:24, 12 April 2021


During your work at the Centre, you are likely to produce, use and share data on different systems. You will probably have access to two different systems: your University system and NCI.

Storage at NCI


NCI provides two types of storage: tape and disk. Tape is for long term storage while disk is more suited to store data you need to access often.

Tape at NCI


The tape system at NCI is called /massdata. Please read the Users' guide to learn how to use this system. Here are a few important points to keep in mind:

  • Tape is mostly appropriate for archiving data.
  • You should only store big files on tape. If you want to migrate a lot of small files, you should first archive them together. To learn how to do that please have a look at the Users' guide and email your questions to us.
  • /massdata is only accessible from the login nodes (interactive) or via a script submitted to the copyq queue. It is generally recommended to use the copyq queue as you then have a much longer run time.
  • Tape access (writing and reading) is slow.
  • Considering data on /massdata are likely to be un-used for a long time it is quite essential to document your data. For example adding a detailed README file to your data folder can help a lot.
  • Storage update for quota is only updated daily, overnight for massdata because of the size of it. It is then recommended to act quickly for clean up and if possible before breaching the quota.

Additional storage

It might be possible to add quota on massdata for your project to do so please send an email to us detailing how much additional space you would like and which NCI project it is for. Again, being tape storage the request might take some days to be processed so please plan ahead.

Disk at NCI


There are three different disk filesystems at NCI, each with a slightly different purpose. NCI also has a Users' Guide. All the disk filesystems are accessible from the login nodes and the compute nodes hence you can read/write to one while currently being in an other filesystem. And all these filesystems have access to massdata either through login nodes connection or sending a script to copyq queue.

/home

  • This is your home directory.
  • This space is strictly limited at 2GB for each user but it is backed up.
  • It is most suitable for storing source code rather than model outputs or observation datasets.
  • You can monitor your use on home with the "lquota"[1] command.

/short

  • All projects have some storage on /short.
  • The amount of space varies from a project to another.
  • The management of the space is left to the responsibility of the members of each project. Although automatic emails are sent to the project members when the usage comes close to fill in the quota.
  • When a project fills its quota on /short, the project's members will not be able to use the computing queues except for copyq queues to help with moving data around.
  • To monitor the overall usage, please use the "nci_account}}"[1] command. To monitor the usage per user, please use "{{short_files_report -P $PROJECT" [1].
  • Increasing the quota on /short for a project might be possible but is left to the decision of NCI staff. If you want to try to have an extension, please send us an [| email].

/g/data

  • Most projects now have some storage on /g/data1 or /g/data3.
  • As for /short, the quota on /g/data is per project with management of the usage the sole responsibility of the project's members.
  • /g/data can be less stable than /short. As such it is recommended to use the special PBS resource: #PBS -lother=gdata. Then your job will only start when the /g/data filesystem is accessible.
  • Compute nodes have both read and write access to this filesystem.
  • Your project's quota on /g/data is not extensible by simple email. There is a review of some of the quotas every year, at which point some projects might be granted an increase. If you need additional storage before then, please consider deleting old or incorrect data, archiving old data to /massdata, using temporary storage or your University system. If you still want an increase to be considered at review time, please make sure to discuss it with the Lead CI of your project who will be part of the review.
  • To monitor usage, please use "nci_account}}" [1]. To see a per-user summary run the command "{{gdata1_files_report -P $PROJECT"

Temporary storage


The CMS is also managing two projects on NCI that can be used for temporary storage of data. Both are mounted on /g/data and have the same characteristic as other /g/data storage space as explained above.

To use any space on these projects, you need to:

  • request connection to the project if you are not yet a member. You can check which projects you are part of with the "groups" command.
  • fill a storage request using theDMPonline tool. If you do not yet have an account on this tool, please be patient for the account creation. To avoid robots and unauthorised access, the account creation requires human verification on our end. Please email us if you have any question about filling in the form. Note this form is principally to enable us to monitor the space used, requested and available. It also enables us to prepare a folder for you with appropriate permissions. The forms are very short and quick to fill and the storage is usually ready for use within a few hours. See this page for more detailed instructions on how to fill the form. Note:
    • you do not need to email us, please just fill in the form to request the allocation you would like.
    • also, you can request space for use by a whole group instead of per user, but all users of the group must request connection to the project.

The temporary storage projects are:

  • /g/data3/hh5: this project is for short temporary use (~3 months). It could be used for example to print your raw model outputs, then you would save a subset or a reformatted version to your project's space and move the raw outputs to /massdata for safekeeping.
  • /g/data1/ua8: the main purpose of this project is to store published datasets created by the Centre's staff. For example, some journals now request that researchers publish their data in parallel to their papers. It is also used for small downloaded datasets that are shared across the CoE and do not have their own "project". However, the free space in this project can be used as temporary storage for data that is being processed for publication.

Storage at Universities


at ANU

at Monash

at UMelb

at UNSW

at UTas

File compression and archiving


For an efficient use of storage, there are a few rules to keep in mind:

  • it is more efficient to store a few larger files than lots of smaller files. It is hard to define large and small but files of several tens of gigabytes are absolutely acceptable. The size of the files should clearly also take into account how you or others are going to use them. File nearing 100 GB become unmanageable and should be produced only if there is no other option. In that case you should read about ...
  • It is always best practice to compress your data when possible. Netcdf files are now easily compressible, see this article for detailed explanations on tools available at NCI.

To store small setup files that define your experiments, think about using the "tar" command. This is a shell command with a manual accessible through

man tar

This command will save many files together in a single archive, it can be used on a directory tree and will restore the directory structure when restoring the files from the archive. This means if you have several experiments you need to save the setup of, the best way might be to create a directory tree containing the setup files of all the experiments then create one single archive file for all. The archive files can also easily be compressed/uncompressed using the gzip utility either at the archive creation time or afterwards. [1]: see the 2017 training material for usage of these commands and acls and tar cheat sheets.

For more details see Archiving Data