Difference between revisions of "Storage"

 
(15 intermediate revisions by 3 users not shown)
Line 1: Line 1:
  
During your work at the Centre, you are likely to produce, use and share data on different systems. You will probably have access to two different systems: your University system and NCI.
+
<span style="font-size:medium">During your work at the Centre, you are likely to produce, use and share data on different systems. You will probably have access to two different systems: your University system and NCI.</span>
  
= Storage at NCI =
+
&nbsp;
 +
 
 +
= Storage on Gadi =
 +
 
 +
<span style="font-size:medium">On Gadi, NCI has four different filesystems: $HOME, /scratch, /g/data and mdss. mdss is a tape-based filesystem while $HOME, /scratch and /g/data are disk-based.</span>
 +
 
 +
<span style="font-size:medium">There are also specific filesystems linked to the Open On Demand system which are less used. On this page, we will only talk about the storage available from Gadi.</span>
 +
 
 +
 
 +
== Hierarchy of filesystems ==
 +
 
 +
<span style="font-size:medium">The different filesystems are best thought of as a hierarchy, each with a specific usecase:</span>
 +
 
 +
*<span style="font-size:medium"><span style="color:#8e44ad">$HOME:</span> this is a small disk space that is backed up by NCI. It is best suited for important, hard to reproduce files, e.g. data analysis codes.</span>
 +
*<span style="font-size:medium"><span style="color:#8e44ad">/scratch</span>: this is a temporary disk space. Files are automatically deleted on some condition. It is best suited for raw output from climate models.</span>
 +
*<span style="font-size:medium"><span style="color:#8e44ad">/g/data</span>: this is a permanent disk space. It is best suited for data that is used over a long time, e.g. climate models inputs, climate models outputs that is being analysed, climate models source code that needs to be compiled.</span>
 +
*<span style="font-size:medium"><span style="color:#8e44ad">mdss</span>: this is a tape space. It is best suited as a backup of important files or an archiving system.</span>
 +
 
 +
<span style="font-size:medium">This means you need to implement a data management workflow that is compatible with the filesystems specifications, your scientific project and the management of your NCI's project. We have [https://climate-cms.org/posts/2022-04-26-storage-where-what-why-how.html a blog post] that can help with identifying the right questions and strategies for your data.</span>
 +
 
 +
=== <span style="font-size:x-large">$HOME</span> ===
 +
 
 +
*<span style="font-size:medium">This is your home directory on Gadi.</span>
 +
*<span style="font-size:medium">Unaccessible from OOD</span>
 +
*<span style="font-size:medium">This space is strictly limited at 10GB for each user but it is backed up by NCI.</span>
 +
*<span style="font-size:medium">It is most suitable for storing source code rather than model outputs or observation datasets. We still encourage you to also use Git to manage your source code and to save it to Github.</span>
 +
*<span style="font-size:medium">You can monitor your use on home with the "quota" command. You may have errors at log in when your home directory is full.</span>
 +
 
 +
 
 +
 
 +
=== <span style="font-size:x-large">/scratch</span> ===
 +
 
 +
{| style="height: 50px;  width: 800px" cellspacing="4" cellpadding="4" border="3" align="center"
 +
|-
 +
| <span style="font-size:medium"><span style="color:#c0392b">Warning: this is a temporary space with [https://opus.nci.org.au/pages/viewpage.action?pageId=156434436 automatic deletion]. All users are responsible to learn about the deletion process.</span></span>
 +
|}
 +
 
 +
*<span style="font-size:medium">All projects have some storage on /scratch.</span>
 +
*<span style="font-size:medium">It is the recommended filesystem to write outputs from your simulations into from the compute nodes.</span>
 +
*<span style="font-size:medium">Unaccessible from OOD</span>
 +
*<span style="font-size:medium">The amount of space varies from a project to another.</span>
 +
*<span style="font-size:medium">The total space allocation is shared between all project members.</span>
 +
*<span style="font-size:medium">The management of the space is left to the responsibility of the members of each project.</span>
 +
*<span style="font-size:medium">There is [https://opus.nci.org.au/pages/viewpage.action?pageId=156434436 an automated file management system] in place. It will automatically remove any file that has not been accessed in the last 100 days.</span>
 +
*<span style="font-size:medium">When a project fills its quota on /scratch, the project's members will not be able to use the computing queues except for copyq queues to help with moving data around.</span>
 +
*<span style="font-size:medium">To monitor the overall usage, please use the <tt>nci_account</tt> command. To monitor the usage per user, please use <tt>nci-files-report -f scratch</tt>.</span>
  
NCI provides two types of storage: tape and disk. Tape is for long term storage while disk is more suited to store data you need to access often.
+
'''<span style="font-size:medium">Additional storage</span>'''
  
== Tape at NCI ==
+
<span style="font-size:medium">Increasing the quota on /scratch for a project might be possible but is left to the decision of NCI staff. An extension can only be requested by the Lead CI for the project. You can check who the Lead CI is on [https://my.nci.org.au/ my.nci.org.au].</span>
  
The tape system at NCI is called /massdata. Please read the [http://nci.org.au/user-support/getting-help/massdata-user-guide/ Users' guide] to learn how to use this system. Here are a few important points to keep in mind:
+
&nbsp;
  
*Tape is mostly appropriate for archiving data.
 
*You should only store big files on tape. If you want to migrate a lot of small files, you should first archive them together. To learn how to do that please have a look at the [http://nci.org.au/user-support/getting-help/massdata-user-guide/ Users' guide] and [mailto:climate_help@nci.org.au| email]&nbsp;your questions to us.
 
*/massdata is only accessible from the login nodes (interactive) or via a script submitted to the copyq queue. It is generally recommended to use the copyq queue as you then have a much longer run time.
 
*Tape access (writing and reading) is slow.
 
*Considering data on /massdata are likely to be un-used for a long time it is quite essential to document your data. For example adding a detailed README file to your data folder can help a lot.
 
*Storage update for quota is only updated daily, overnight for massdata because of the size of it. It is then recommended to act quickly for clean up and if possible before breaching the quota.
 
  
=== Additional storage ===
+
=== <span style="font-size:x-large">/g/data</span> ===
  
It might be possible to add quota on massdata for your project to do so please send an [mailto:climate_help@nci.org.au| email to us]&nbsp;detailing how much additional space you would like and which NCI project it is for. Again, being tape storage the request might take some days to be processed so please plan ahead.
+
*<span style="font-size:medium">The quota on /g/data is per project with management of the usage the sole responsibility of the project's members.</span>
 +
*<span style="font-size:medium">Accessible from OOD</span>
 +
*<span style="font-size:medium">For compute nodes to have read and/or write access to this filesystem, you need to mount the specific area you need with the "-l storage" PBS flag. Remember do not write the outputs of large climate simulations to /g/data directly. Only write from the compute nodes to this space from an analysis job for example.</span>
 +
*<span style="font-size:medium">To monitor usage, please use <tt>nci_account</tt>. To see a per-user summary run the command <tt>nci-files-report -f gdata</tt></span>
  
== Disk at NCI ==
+
<span style="font-size:medium">'''Additional storage'''</span>
  
There are three different disk filesystems at NCI, each with a slightly different purpose. NCI also has a [http://nci.org.au/user-support/getting-help/filesystem-user-guide/ Users' Guide]. All the disk filesystems are accessible from the login nodes and the compute nodes hence you can read/write to one while currently being in an other filesystem. And all these filesystems have access to massdata either through login nodes connection or sending a script to copyq queue.
+
<span style="font-size:medium">A project's quota on /g/data cannot be increased with a simple request. There is a review of some of the quotas every year, at which point some projects might be granted an increase. If you need additional storage before then, please consider deleting old or incorrect data, archiving old data to massdata, using [http://climate-cms.wikis.unsw.edu.au/Storage#Temporary_storage temporary storage] or your University system. If you still want an increase to be considered at review time, please make sure to discuss it with the Lead CI of your project who will be part of the review.</span>
  
=== /home ===
+
=== <span style="font-size:x-large">Tape mdss</span> ===
  
*This is your home directory.
+
<span style="font-size:medium">The tape system at NCI is called mdss or massdata. Please read the [[Archiving_data|archiving data wiki page]]&nbsp;to learn how to use this system.</span>
*This space is strictly limited at 10GB for each user but it is backed up.
 
*It is most suitable for storing source code rather than model outputs or observation datasets.  
 
*You can monitor your use on home with the "lquota"[1] command.  
 
  
=== /scratch ===
+
*<span style="font-size:medium">Tape access (writing and reading) is slow.</span>
 +
*<span style="font-size:medium">Unaccessible from OOD</span>
 +
*<span style="font-size:medium">Tape is mostly appropriate for backing up or archiving data.</span>
 +
*<span style="font-size:medium">You should only store big files on tape. If you want to migrate a lot of small files, you should first archive them together. To learn how to do that please have a look at "File compression and archiving" section below&nbsp;and [mailto:climate_help@nci.org.au| email]&nbsp;your questions to us.</span>
 +
*<span style="font-size:medium">massdata is only accessible from the login nodes (interactive) or via a script submitted to the copyq queue. It is only accessible via specific commands which are detailed in the archiving data wiki page. It is generally recommended to use the copyq queue as you then have a much longer run time.</span>
 +
*<span style="font-size:medium">Considering data on massdata are likely to be unused for a long time it is quite essential to document your data. For example adding a detailed README file to your data folder can help a lot.</span>
  
*All projects have some storage on /scratch.
+
== Use cases for the various filesystems ==
*The amount of space varies from a project to another.
 
*The management of the space is left to the responsibility of the members of each project. Although automatic emails are sent to the project members when the usage comes close to fill in the quota.
 
*When a project fills its quota on /short, the project's members will not be able to use the computing queues except for copyq queues to help with moving data around.
 
*To monitor the overall usage, please use the "nci_account"[1] command. To monitor the usage per user, please use "{{short_files_report -P $PROJECT}}" [1].
 
*Increasing the quota on /scratch for a project might be possible but is left to the decision of NCI staff. An extension can only be requested by the Lead CI for the project. You can check who the Lead CI is on my.nci.org.au.
 
  
=== /g/data ===
+
=== <span style="font-size:13.999999999999998pt;  font-family:Arial;  color:#434343;  background-color:transparent;  font-weight:400;  font-style:normal;  font-variant:normal;  text-decoration:none;  vertical-align:baseline;  white-space:pre;  white-space:pre-wrap">Keep on scratch only</span> ===
 +
 
 +
#<span style="font-size:medium"><span style="font-family: Arial;  color: rgb(0, 0, 0);  background-color: transparent;  font-weight: 400;  font-style: normal;  font-variant: normal;  text-decoration: none;  vertical-align: baseline;  white-space: pre-wrap">Producing large amounts of temporary data that will be deleted once analysis is completed</span></span>
 +
#<span style="font-size:medium"><span style="font-family: Arial;  color: rgb(0, 0, 0);  background-color: transparent;  font-weight: 400;  font-style: normal;  font-variant: normal;  text-decoration: none;  vertical-align: baseline;  white-space: pre-wrap">Data duplicated from another site (or from massdata) used regularly as input for models or analysis, so access time is constantly updated and they can be copied again if deleted</span></span>
 +
#<span style="font-size:medium"><span style="font-family: Arial;  color: rgb(0, 0, 0);  background-color: transparent;  font-weight: 400;  font-style: normal;  font-variant: normal;  text-decoration: none;  vertical-align: baseline;  white-space: pre-wrap">Temporary run directory for a simulation.</span></span>
 +
#<span style="font-size:medium"><span style="font-family: Arial;  color: rgb(0, 0, 0);  background-color: transparent;  font-weight: 400;  font-style: normal;  font-variant: normal;  text-decoration: none;  vertical-align: baseline;  white-space: pre-wrap">PBS log files and other log files from a simulation or analysis</span></span>
 +
 
 +
=== <span style="font-size:13.999999999999998pt;  font-family:Arial;  color:#434343;  background-color:transparent;  font-weight:400;  font-style:normal;  font-variant:normal;  text-decoration:none;  vertical-align:baseline;  white-space:pre;  white-space:pre-wrap">Keep on scratch but migrate when finished(/g/data, /scratch, off-site, tape)</span> ===
 +
 
 +
#<span style="font-size:medium"><span style="font-family: Arial;  color: rgb(0, 0, 0);  background-color: transparent;  font-weight: 400;  font-style: normal;  font-variant: normal;  text-decoration: none;  vertical-align: baseline;  white-space: pre-wrap">Climate model output or other data that will be analysed and archived in a short time frame, within the 100 day expiry time limit, taking into account that accessing the files will reset the expiry time</span></span>
 +
#<span style="font-size:medium"><span style="font-family: Arial;  color: rgb(0, 0, 0);  background-color: transparent;  font-weight: 400;  font-style: normal;  font-variant: normal;  text-decoration: none;  vertical-align: baseline;  white-space: pre-wrap">Figures in some cases. If you are producing a very large number of figures and only keep a few for the long term.</span></span>
 +
 
 +
=== <span style="font-size:13.999999999999998pt;  font-family:Arial;  color:#434343;  background-color:transparent;  font-weight:400;  font-style:normal;  font-variant:normal;  text-decoration:none;  vertical-align:baseline;  white-space:pre;  white-space:pre-wrap">Create initially on scratch but migrate to /g/data automatically and delete from scratch</span> ===
 +
 
 +
#<span style="font-size:medium"><span style="font-family: Arial;  color: rgb(0, 0, 0);  background-color: transparent;  font-weight: 400;  font-style: normal;  font-variant: normal;  text-decoration: none;  vertical-align: baseline;  white-space: pre-wrap">Long climate model runs that might take longer than the 100 day expiry date to complete, where there is no possibility to analyse the data before the run is complete</span></span>
 +
#<span style="font-size:medium"><span style="font-family: Arial;  color: rgb(0, 0, 0);  background-color: transparent;  font-weight: 400;  font-style: normal;  font-variant: normal;  text-decoration: none;  vertical-align: baseline;  white-space: pre-wrap">Shared climate model runs that may well be accessed by a wide variety of people but the access patterns are not predictable. e.g. CMIP, CORDEX, COSIMA outputs</span></span>
 +
 
 +
=== <span style="font-size:13.999999999999998pt;  font-family:Arial;  color:#434343;  background-color:transparent;  font-weight:400;  font-style:normal;  font-variant:normal;  text-decoration:none;  vertical-align:baseline;  white-space:pre;  white-space:pre-wrap">Create in /home or /g/data directly</span> ===
 +
 
 +
#<span style="font-size:medium"><span style="font-family: Arial;  color: rgb(0, 0, 0);  background-color: transparent;  font-weight: 400;  font-style: normal;  font-variant: normal;  text-decoration: none;  vertical-align: baseline;  white-space: pre-wrap">Any code: your own analysis code, climate models source codes</span></span>
 +
#<span style="font-size:medium"><span style="font-family: Arial;  color: rgb(0, 0, 0);  background-color: transparent;  font-weight: 400;  font-style: normal;  font-variant: normal;  text-decoration: none;  vertical-align: baseline;  white-space: pre-wrap">External packages installed locally</span></span>
 +
#<span style="font-size:medium"><span style="font-family: Arial;  color: rgb(0, 0, 0);  background-color: transparent;  font-weight: 400;  font-style: normal;  font-variant: normal;  text-decoration: none;  vertical-align: baseline;  white-space: pre-wrap">DO NOT RUN CLIMATE MODELS AND OUTPUT DIRECTLY TO EITHER OF THESE LOCATIONS. This is an anti-pattern, it is not performant and NCI will become quite annoyed as /scratch is designed for this use-case and /g/data and /home are not.</span></span>
  
*Most projects now have some storage on /g/data
 
*As for /scratch, the quota on /g/data is per project with management of the usage the sole responsibility of the project's members.
 
*/g/data can be less stable than /scratch.
 
*For compute nodes to have read and/or write access to this filesystem, you need to mount the specific area you need with the "-l storage" PBS flag
 
*Your project's quota on /g/data is not extensible by simple email. There is a review of some of the quotas every year, at which point some projects might be granted an increase. If you need additional storage before then, please consider deleting old or incorrect data, archiving old data to /massdata, using [[Storage#temporary|temporary storage]] or your [[Storage#local|University system]]. If you still want an increase to be considered at review time, please make sure to discuss it with the Lead CI of your project who will be part of the review.
 
*To monitor usage, please use "nci_account" [1]. To see a per-user summary run the command "{{gdata1_files_report -P $PROJECT"
 
  
 
== Temporary storage ==
 
== Temporary storage ==
  
----
+
<span style="font-size:medium">The CMS is also managing two projects on NCI that can be used for temporary storage of data. Both are mounted on /g/data and have the same characteristic as other /g/data storage space as explained above.</span>
  
The CMS is also managing two projects on NCI that can be used for temporary storage of data. Both are mounted on /g/data and have the same characteristic as other /g/data storage space as explained above.
+
<span style="font-size:medium">The temporary storage projects are:</span>
  
To use any space on these projects''', you need to:'''
+
*<span style="font-size:medium">'''/g/data/hh5''': this project is for short temporary use (~3 months). It could be used for example to print your raw model outputs, then you would save a subset or a reformatted version to your project's space and move the raw outputs to massdata for safekeeping.</span>
 +
*<span style="font-size:medium">'''/g/data/ua8''': the main purpose of this project is to store replica of datasets which do not&nbsp;have a specific data&nbsp;project assigned. However, the free space in this project can be used as temporary storage&nbsp;for data that is being processed for&nbsp;publication.</span>
  
*request [https://my.nci.org.au/mancini/login?next=/mancini/project/ connection to the project] if you are not yet a member. You can check which projects you are part of with the "groups" command.
+
=== <span style="font-size:x-large">Request access</span> ===
*fill a storage request using the&nbsp;[https://clex.dmponline.cloud.edu.au/ CLEX DMPonline tool]. If you do not yet have an account on this tool, please be patient for the account creation. To avoid robots and unauthorised access, the account creation requires human verification on our end. Please [mailto:cws_help@nci.org.au email us] if you have any question about filling in the form. Note this form is principally to enable us to monitor the space used, requested and available. It also enables us to prepare a folder for you with appropriate permissions. The forms are very short and quick to fill and the storage is usually ready for use within a few hours. See [[Storage-request|this page]] for more detailed instructions on how to fill the form. Note:
 
**you do not need to email us, please just fill in the form to request the allocation you would like.
 
**also, you can request space for use by a whole group instead of per user, but all users of the group '''must''' request connection to the project. 
 
  
The temporary storage projects are:
+
<span style="font-size:medium">To use any space on these projects''', you need to:'''</span>
  
*/g/data/hh5: this project is for short temporary use (~3 months). It could be used for example to print your raw model outputs, then you would save a subset or a reformatted version to your project's space and move the raw outputs to /massdata for safekeeping.
+
*<span style="font-size:medium">request [https://my.nci.org.au/mancini/login?next=/mancini/project/ connection to the project] if you are not yet a member. You can check which projects you are part of with the "groups" command.</span>
*/g/data/ua8: the main purpose of this project is to store replica of datasets which do not&nbsp;have a specific data&nbsp;project assigned. However, the free space in this project can be used as temporary storage&nbsp;for data that is being processed for&nbsp;publication.  
+
*<span style="font-size:medium">fill a storage request using the&nbsp;[https://clex.dmponline.cloud.edu.au/ CLEX DMPonline tool]. If you do not yet have an account on this tool, please be patient for the account creation. To avoid robots and unauthorised access, the account creation requires human verification on our end. Please [mailto:cws_help@nci.org.au email us] if you have any question about filling in the form. Note this form is principally to enable us to monitor the space used, requested and available. It also enables us to prepare a folder for you with appropriate permissions. The forms are very short and quick to fill and the storage is usually ready for use within a few hours. See [[Storage-request|this page]] for more detailed instructions on how to fill the form. Note:</span>
 +
**<span style="font-size:medium">you do not need to email us, please just fill in the form to request the allocation you would like.</span>
 +
**<span style="font-size:medium">also, you can request space for use by a whole group instead of per user, but all users of the group '''must''' request connection to the project.</span>
  
 
= Storage at Universities =
 
= Storage at Universities =
  
----
+
<span style="font-size:medium"><span style="line-height: 1.5">at ANU</span></span>
  
<span style="line-height: 1.5;">at ANU</span>
+
<span style="font-size:medium">at Monash</span>
  
at Monash
+
<span style="font-size:medium">at UMelb</span>
  
at UMelb
+
<span style="font-size:medium">at UNSW</span>
  
at UNSW
+
<span style="font-size:medium">at UTAS</span>
  
at UTas
 
  
 
= File compression and archiving =
 
= File compression and archiving =
  
----
+
<span style="font-size:medium">See our [[Archiving_data|Archiving Data]] page for more details. Below are a few rules and pointers which are important to know before starting any work.</span>
  
For an efficient use of storage, there are a few rules to keep in mind:
+
<span style="font-size:medium">For an efficient use of storage, there are a few rules to keep in mind:</span>
  
*it is more efficient to store a few larger files than lots of smaller files. It is hard to define large and small but files of several tens of gigabytes are absolutely acceptable. The size of the files should clearly also take into account how you or others are going to use them.&nbsp;File nearing 100 GB become unmanageable and should be produced only if there is no other option. In that case you should read about ...
+
*<span style="font-size:medium">it is more efficient to store a few larger files than lots of smaller files. It is hard to define large and small but files of several tens of gigabytes are absolutely acceptable. The size of the files should clearly also take into account how you or others are going to use them.&nbsp;File nearing 100 GB become unmanageable and should be produced only if there is no other option.</span>
*It is always best practice to compress your data when possible.&nbsp;Netcdf files are now easily compressible, see [[NetCDF_Compression_Tools|this article]] for detailed explanations on tools available at NCI.  
+
*<span style="font-size:medium">It is always best practice to compress your data when possible.&nbsp;Netcdf files are now easily compressible, see [[NetCDF_Compression_Tools|this wiki page]]&nbsp;for detailed explanations on tools available at NCI.</span>
  
To store small setup files that define your experiments, think about using the "tar" command. This is a shell command with a manual accessible through
+
<span style="font-size:medium">To store small setup files that define your experiments, think about using the "tar" command. Here is [https://docs.google.com/document/d/1do9dtyzTQ5VY3yFC6YEj-fhAW7OnkA5jFzol1vjRu4w/edit?usp=sharing a cheat sheet] for tar. This is a shell command with a manual accessible through</span>
<syntaxhighlight lang="bash">
+
<syntaxhighlight lang="bash">man tar
man tar
 
 
</syntaxhighlight>
 
</syntaxhighlight>
This command will save many files together in a single archive, it can be used on a directory tree and will restore the directory structure when restoring the files from the archive. This means if you have several experiments you need to save the setup of, the best way might be to create a directory tree containing the setup files of all the experiments then create one single archive file for all. The archive files can also easily be compressed/uncompressed using the gzip utility either at the archive creation time or afterwards. [1]: see the [https://drive.google.com/drive/folders/0B1KncyYETiZqNGdHZ0VZSk02cUE 2017 training material] for usage of these commands and acls and tar cheat sheets. For more details see [[Archiving_Data|Archiving Data]]
+
<span style="font-size:medium">This command will save many files together in a single archive, it can be used on a directory tree and will restore the directory structure when restoring the files from the archive. This means if you have several experiments you need to save the setup of, the best way might be to create a directory tree containing the setup files of all the experiments then create one single archive file for all. The archive files can also easily be compressed/uncompressed using the gzip utility either at the archive creation time or afterwards.&nbsp;</span>
&nbsp;
 
 
 
 
[[Category:Data induction]]
 
[[Category:Data induction]]

Latest revision as of 23:32, 26 April 2022

During your work at the Centre, you are likely to produce, use and share data on different systems. You will probably have access to two different systems: your University system and NCI.

 

Storage on Gadi

On Gadi, NCI has four different filesystems: $HOME, /scratch, /g/data and mdss. mdss is a tape-based filesystem while $HOME, /scratch and /g/data are disk-based.

There are also specific filesystems linked to the Open On Demand system which are less used. On this page, we will only talk about the storage available from Gadi.


Hierarchy of filesystems

The different filesystems are best thought of as a hierarchy, each with a specific usecase:

  • $HOME: this is a small disk space that is backed up by NCI. It is best suited for important, hard to reproduce files, e.g. data analysis codes.
  • /scratch: this is a temporary disk space. Files are automatically deleted on some condition. It is best suited for raw output from climate models.
  • /g/data: this is a permanent disk space. It is best suited for data that is used over a long time, e.g. climate models inputs, climate models outputs that is being analysed, climate models source code that needs to be compiled.
  • mdss: this is a tape space. It is best suited as a backup of important files or an archiving system.

This means you need to implement a data management workflow that is compatible with the filesystems specifications, your scientific project and the management of your NCI's project. We have a blog post that can help with identifying the right questions and strategies for your data.

$HOME

  • This is your home directory on Gadi.
  • Unaccessible from OOD
  • This space is strictly limited at 10GB for each user but it is backed up by NCI.
  • It is most suitable for storing source code rather than model outputs or observation datasets. We still encourage you to also use Git to manage your source code and to save it to Github.
  • You can monitor your use on home with the "quota" command. You may have errors at log in when your home directory is full.


/scratch

Warning: this is a temporary space with automatic deletion. All users are responsible to learn about the deletion process.
  • All projects have some storage on /scratch.
  • It is the recommended filesystem to write outputs from your simulations into from the compute nodes.
  • Unaccessible from OOD
  • The amount of space varies from a project to another.
  • The total space allocation is shared between all project members.
  • The management of the space is left to the responsibility of the members of each project.
  • There is an automated file management system in place. It will automatically remove any file that has not been accessed in the last 100 days.
  • When a project fills its quota on /scratch, the project's members will not be able to use the computing queues except for copyq queues to help with moving data around.
  • To monitor the overall usage, please use the nci_account command. To monitor the usage per user, please use nci-files-report -f scratch.

Additional storage

Increasing the quota on /scratch for a project might be possible but is left to the decision of NCI staff. An extension can only be requested by the Lead CI for the project. You can check who the Lead CI is on my.nci.org.au.

 


/g/data

  • The quota on /g/data is per project with management of the usage the sole responsibility of the project's members.
  • Accessible from OOD
  • For compute nodes to have read and/or write access to this filesystem, you need to mount the specific area you need with the "-l storage" PBS flag. Remember do not write the outputs of large climate simulations to /g/data directly. Only write from the compute nodes to this space from an analysis job for example.
  • To monitor usage, please use nci_account. To see a per-user summary run the command nci-files-report -f gdata

Additional storage

A project's quota on /g/data cannot be increased with a simple request. There is a review of some of the quotas every year, at which point some projects might be granted an increase. If you need additional storage before then, please consider deleting old or incorrect data, archiving old data to massdata, using temporary storage or your University system. If you still want an increase to be considered at review time, please make sure to discuss it with the Lead CI of your project who will be part of the review.

Tape mdss

The tape system at NCI is called mdss or massdata. Please read the archiving data wiki page to learn how to use this system.

  • Tape access (writing and reading) is slow.
  • Unaccessible from OOD
  • Tape is mostly appropriate for backing up or archiving data.
  • You should only store big files on tape. If you want to migrate a lot of small files, you should first archive them together. To learn how to do that please have a look at "File compression and archiving" section below and email your questions to us.
  • massdata is only accessible from the login nodes (interactive) or via a script submitted to the copyq queue. It is only accessible via specific commands which are detailed in the archiving data wiki page. It is generally recommended to use the copyq queue as you then have a much longer run time.
  • Considering data on massdata are likely to be unused for a long time it is quite essential to document your data. For example adding a detailed README file to your data folder can help a lot.

Use cases for the various filesystems

Keep on scratch only

  1. Producing large amounts of temporary data that will be deleted once analysis is completed
  2. Data duplicated from another site (or from massdata) used regularly as input for models or analysis, so access time is constantly updated and they can be copied again if deleted
  3. Temporary run directory for a simulation.
  4. PBS log files and other log files from a simulation or analysis

Keep on scratch but migrate when finished(/g/data, /scratch, off-site, tape)

  1. Climate model output or other data that will be analysed and archived in a short time frame, within the 100 day expiry time limit, taking into account that accessing the files will reset the expiry time
  2. Figures in some cases. If you are producing a very large number of figures and only keep a few for the long term.

Create initially on scratch but migrate to /g/data automatically and delete from scratch

  1. Long climate model runs that might take longer than the 100 day expiry date to complete, where there is no possibility to analyse the data before the run is complete
  2. Shared climate model runs that may well be accessed by a wide variety of people but the access patterns are not predictable. e.g. CMIP, CORDEX, COSIMA outputs

Create in /home or /g/data directly

  1. Any code: your own analysis code, climate models source codes
  2. External packages installed locally
  3. DO NOT RUN CLIMATE MODELS AND OUTPUT DIRECTLY TO EITHER OF THESE LOCATIONS. This is an anti-pattern, it is not performant and NCI will become quite annoyed as /scratch is designed for this use-case and /g/data and /home are not.


Temporary storage

The CMS is also managing two projects on NCI that can be used for temporary storage of data. Both are mounted on /g/data and have the same characteristic as other /g/data storage space as explained above.

The temporary storage projects are:

  • /g/data/hh5: this project is for short temporary use (~3 months). It could be used for example to print your raw model outputs, then you would save a subset or a reformatted version to your project's space and move the raw outputs to massdata for safekeeping.
  • /g/data/ua8: the main purpose of this project is to store replica of datasets which do not have a specific data project assigned. However, the free space in this project can be used as temporary storage for data that is being processed for publication.

Request access

To use any space on these projects, you need to:

  • request connection to the project if you are not yet a member. You can check which projects you are part of with the "groups" command.
  • fill a storage request using the CLEX DMPonline tool. If you do not yet have an account on this tool, please be patient for the account creation. To avoid robots and unauthorised access, the account creation requires human verification on our end. Please email us if you have any question about filling in the form. Note this form is principally to enable us to monitor the space used, requested and available. It also enables us to prepare a folder for you with appropriate permissions. The forms are very short and quick to fill and the storage is usually ready for use within a few hours. See this page for more detailed instructions on how to fill the form. Note:
    • you do not need to email us, please just fill in the form to request the allocation you would like.
    • also, you can request space for use by a whole group instead of per user, but all users of the group must request connection to the project.

Storage at Universities

at ANU

at Monash

at UMelb

at UNSW

at UTAS


File compression and archiving

See our Archiving Data page for more details. Below are a few rules and pointers which are important to know before starting any work.

For an efficient use of storage, there are a few rules to keep in mind:

  • it is more efficient to store a few larger files than lots of smaller files. It is hard to define large and small but files of several tens of gigabytes are absolutely acceptable. The size of the files should clearly also take into account how you or others are going to use them. File nearing 100 GB become unmanageable and should be produced only if there is no other option.
  • It is always best practice to compress your data when possible. Netcdf files are now easily compressible, see this wiki page for detailed explanations on tools available at NCI.

To store small setup files that define your experiments, think about using the "tar" command. Here is a cheat sheet for tar. This is a shell command with a manual accessible through

man tar

This command will save many files together in a single archive, it can be used on a directory tree and will restore the directory structure when restoring the files from the archive. This means if you have several experiments you need to save the setup of, the best way might be to create a directory tree containing the setup files of all the experiments then create one single archive file for all. The archive files can also easily be compressed/uncompressed using the gzip utility either at the archive creation time or afterwards.