Difference between revisions of "How to use MDSS tape storage at NCI"

 
(3 intermediate revisions by 3 users not shown)
Line 1: Line 1:
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">All universities have their own archives where you can store your data, for more information contact your Library or IT department. Here we will focus on archiving data using the NCI archive storage, for all cases where using a university archive is not applicable.</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Massdata (Mass Data Storage System, MDSS for short) is the tape storage available at NCI. This kind of storage is intended for long term storage of large files. It is possible to retrieve data from MDSS, so this is a good place to store data that will not be required for some time, for&nbsp;backup,&nbsp;or for&nbsp;archiving. Each project has a directory on the MDSS, the amount of storage allocated depends on the project allocation.</span></span>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Massdata (Mass Data Storage System, MDSS for short) is the tape storage available at NCI. This kind of storage is intended for long term archiving of large files. Each project has a directory on the MDSS, the amount of storage allocated depends on the project allocation and can be checked using the [[Accounting_at_NCI|nci_account command.]]</span></span>
+
= <span style="font-family:Arial,Helvetica,sans-serif">MDSS proper usage</span> =
  
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''MDSS proper usage'''</span></span> ===
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">MDSS is designed for medium to long-term storage of large files, this means it is optimised for storing big amounts&nbsp;of data. This means it is most suitable for:</span></span>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">MDSS is designed for medium to long-term archive of large files, this means it is capable and optimise for of storing big amounts&nbsp;of data, but not to retrieve this files often. This means it is most suitable for:</span></span>
+
*<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Files you are required to keep for a long term, like data underlining published datasets, publications, PhD thesis etc.</span></span>
 +
*<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Files that&nbsp;you or someone else are likely to reuse or analyse again in the future but not in the next few months. For example, restart files or other model output you are not immediately using should be moved from disk to massdata&nbsp;as soon as possible.</span></span>
 +
*<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">MDSS is suitable for&nbsp;backup of big data projects, like model output which could not be backed up elsewhere.</span></span>  
  
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Files you are required to keep for a long term, like data underlining published datasets, publications, PhD thesis etc.</span></span>
+
= <span style="font-family:Arial,Helvetica,sans-serif">Preparing your data for mdss</span> =
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Files that&nbsp;you or someone else are likely to reuse or analyse again in the future but not in the next few months. For example restart files or other model output you are not immediately using should be moved from disk to massdata&nbsp;as soon as possible.</span></span>  
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">MDSS is suitable for&nbsp;backup of big data projects, like model output which could not be backed up elsewhere.</span></span>
 
  
&nbsp;
+
#<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Organise your files and delete anything which you will not&nbsp;be re-using. Do not transfer data before organising it. It is difficult to get a&nbsp;list&nbsp;of what is stored on massdata,&nbsp;let alone to list what is in a tarred file once it is uploaded.&nbsp;</span></span>
 +
#<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Big files: use tools like <tt>tar</tt> to bundle files together into archive files. Create reasonably big archive files but also think of how you might want to access the data later. There is no point of tarring together two different simulations if you would want to access them separately, as then you would need to transferred back a big amount of data you do not need.&nbsp;Your upload will fail if any of your files are less than 20MB or the average size is less than 250 MB.</span></span>
 +
#<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Files should be group readable, with group execute permissions for directories. This helps with long term maintenance, allowing administrators to track the type and size of archived data. You can change the permissions on data you own with the <tt>chmod</tt> unix utility.</span></span>
  
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Preparing your data for mdss</span></span>''' ===
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">While you are preparing your data to be moved it is an opportunity to also document it, if you have not done so already. You should document what you are archiving and how you are archiving it. Even a simple readme file added to your main directory can help others and your future self. If you are archiving data underlining a publication or published dataset then it is important to have a summary of what is stored in /massdata. This is part of the [[Data_Management_Plan|dataset management plan]] and/or [[Data_Availability_Statement|data availiability statement]].</span></span>
  
#<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Organise your files and delete anything which you will not&nbsp;be re-using. Do not transfer data before organising it. It is difficult get a&nbsp;list&nbsp;of what is stored on massdata,&nbsp;let alone to list what is in a tarred file once it is uploaded.&nbsp;</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Useful tools:</span></span>
#<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Big files: use tools like tar to bundle files together into archive files. Create reasonably big archive files but also think of how you might want to access the data later.&nbsp;No point of tarring together two different simulations if you would want to access them separately, as then you would need to transferred back a big amount of data you do not need.&nbsp;Your upload will fail if any of your files are less than 20MB or the average size is less than 250 MB.</span></span>
 
#<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Files should be group readable, with group execute permissions for directories. This helps with long term maintenance, allowing administrators to track the type and size of archived data.</span></span>
 
#<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">In the past you could use nci_account to monitor the allocation and how much of it was still available. Currently this is not possible anymore so, particularly if you want to move a big amount of data to massdata, you should first check with the lead CI of the project you want to use, to make sure enough storage is available.</span></span>  
 
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">While you are preparing your data to be moved it is an opportunity to also document, if you have not done so already, what you are archiving and how. Even a simple readme file added to your main directory can help others and your future self. If you are archiving data underlying a publication or published dataset then it is important to have a summary of what is stored in /massdata and how is part of the [[Data_Management_Plan|dataset management plan]] and/or [[Data_Availability_Statement|data availiability statement]].</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;[[TAR_guidelines|TAR- to create archives]] cheatsheet</span></span>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Useful tools:</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;[[NetCDF_Compression_Tools|Compressing tools]]</span></span>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;[[TAR_guidelines|TAR- to create archives]] cheatsheet</span></span>
+
= <span style="font-family:Arial,Helvetica,sans-serif">Accessing MDSS</span> =
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;[[NetCDF_Compression_Tools|Compressing tools]]</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Massdata cannot be accessed directly via a directory path. All access of MDSS is via the command '''<tt>mdss</tt>.'''</span></span>
  
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Accessing MDSS</span></span>''' ===
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Users connected to the project have read, write and execute permissions in the corresponding&nbsp;directory on mdss and so may create their own files in it.</span></span>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Massdata cannot be accessed directly via a directory path. All access of MDSS is via the command '''mdss.'''</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif"><tt>mdss</tt> has several sub-commands and options to see all of them use either:</span></span>
 
+
<syntaxhighlight lang="bash"> $mdss --help
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Users connected to the project have rwx permissions in the corresponding&nbsp;directory and so may create their own files in it.</span></span>
+
or
 
+
  $man mdss
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Mdss has several sub-commands and options to see all of them:</span></span>
 
<syntaxhighlight lang="bash">
 
mdss --help or
 
man mdss;
 
 
</syntaxhighlight>
 
</syntaxhighlight>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Usually you specify the project, if you don't it will use your default project, and then add a sub-command and the path of the files and directories you want to upload, list etc.</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">If you don't specify a project, it will use your default project. Then you add a sub-command and the path of the files and directories you want to upload, list etc.</span></span>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">mdss -P <project-id>;+ <sub-command> + <path></span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">mdss -P <project-id> + <sub-command> + <path></span></span>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Most useful sub-commands are:</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Most useful sub-commands are:</span></span>
<syntaxhighlight lang="bash">
+
<syntaxhighlight lang="bash">mdss put  - upload files  
mdss put  - upload files  
 
 
mdss get  - retrieve files  
 
mdss get  - retrieve files  
 
mdss ls    - list directories and files  
 
mdss ls    - list directories and files  
Line 53: Line 47:
 
mdss dmls  - show what is on cache and what is on tape
 
mdss dmls  - show what is on cache and what is on tape
  
NB "mdss du" will also work but only return the size of what is still cached, dmdu will give the full size of what is on tape regardless that is cached or not.
+
NB "mdss du" will also work but only return the size of what is still cached, dmdu will give the full size of what is on tape regardless if it is cached or not.
  </syntaxhighlight>
+
</syntaxhighlight>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Please note mdss commands work only interactively or with ‘copyq’</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Please note <tt>mdss</tt> sub-commands work only interactively or on the <tt>copyq </tt>queue. To use it on <tt>copyq</tt> remember to set the storage flag as</span></span>
 +
<pre>-l storage=massdata/<project_code></pre>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Monitoring MDSS&nbsp;usage'''</span></span>
+
= <span style="font-family:Arial,Helvetica,sans-serif">Monitoring MDSS&nbsp;usage</span> =
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Unfortunately, there is also not a command to check quickly usage by user-id as for /g/data and /short. The only way to get this information currently is to ask help<at>nci.org.au, administrators can access this information for any CI of the group.</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Unfortunately, there is no command to check the usage by user-id as for /g/data and /scratch. The only way to get this information currently is to ask help<at>nci.org.au. The NCI administrators can access this information for any CI of the group.</span></span>
  
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Transferring data to and from MDSS'''</span></span> ===
+
= <span style="font-family:Arial,Helvetica,sans-serif">Transferring data to and from MDSS</span> =
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">NCI also supports the netmv and netcp commands to work with MDSS. These commands create a copyq job to transfer multiple files. Files can be automatically tarred and compressed as part of the copy process.</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">NCI also supports the <tt>netmv</tt> and <tt>netcp</tt> commands to work with MDSS. These commands create a copyq job to transfer multiple files. Files can be automatically tarred and compressed as part of the copy process.</span></span>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''NOTE: The process of compressing can use a lot of storage on /short if you're moving lots of data!'''</span></span>
+
<span style="color:#c0392b"><span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">'''Warning: The automatic archiving and compression of these tools can use a lot of storage on /scratch if you're moving lots of data!'''</span></span></span>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">For more info run `man netmv`</span></span><span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">or check their [https://opus.nci.org.au/display/Help/Gadi+User+Guide User Guide].</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">For more info run `man netmv`</span></span><span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">.</span></span>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">The CMS team has also developed a utility called [https://github.com/coecms/mdssdiff mdssdiff] available from our conda environments. This utility allows users to compare the contents of the local directory and a directory under /massdata. It will also recursively update the content on the massdata directory to copy the local directory or vice versa.</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">The CMS team has also developed a utility called [https://github.com/coecms/mdssdiff mdssdiff] available from [[Conda|our conda environments]]. This utility allows users to compare the contents of the local directory and a directory under /massdata. It will also recursively update the content on the massdata directory to copy the local directory or vice versa.</span></span>
  
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Modifications to MDSS datasets'''</span></span> ===
+
= <span style="font-family:Arial,Helvetica,sans-serif">Modifications to MDSS datasets</span> =
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Contact NCI at help<at>nci.org.au if large metadata operations are needed on massdata, as changing ownership, project code, permissions etc. of existing datasets</span></span>
+
<span style="font-size:medium"><span style="font-family:Arial,Helvetica,sans-serif">Ask the Lead CI of the project to contact NCI at help@nci.org.au if large metadata operations are needed on massdata, such as changing ownership, project code, permissions etc. of existing datasets</span></span>
  
 
[[Category:NCI Guidelines]] [[Category:Data induction]]
 
[[Category:NCI Guidelines]] [[Category:Data induction]]

Latest revision as of 21:43, 2 May 2022

Massdata (Mass Data Storage System, MDSS for short) is the tape storage available at NCI. This kind of storage is intended for long term storage of large files. It is possible to retrieve data from MDSS, so this is a good place to store data that will not be required for some time, for backup, or for archiving. Each project has a directory on the MDSS, the amount of storage allocated depends on the project allocation.

MDSS proper usage

MDSS is designed for medium to long-term storage of large files, this means it is optimised for storing big amounts of data. This means it is most suitable for:

  • Files you are required to keep for a long term, like data underlining published datasets, publications, PhD thesis etc.
  • Files that you or someone else are likely to reuse or analyse again in the future but not in the next few months. For example, restart files or other model output you are not immediately using should be moved from disk to massdata as soon as possible.
  • MDSS is suitable for backup of big data projects, like model output which could not be backed up elsewhere.

Preparing your data for mdss

  1. Organise your files and delete anything which you will not be re-using. Do not transfer data before organising it. It is difficult to get a list of what is stored on massdata, let alone to list what is in a tarred file once it is uploaded. 
  2. Big files: use tools like tar to bundle files together into archive files. Create reasonably big archive files but also think of how you might want to access the data later. There is no point of tarring together two different simulations if you would want to access them separately, as then you would need to transferred back a big amount of data you do not need. Your upload will fail if any of your files are less than 20MB or the average size is less than 250 MB.
  3. Files should be group readable, with group execute permissions for directories. This helps with long term maintenance, allowing administrators to track the type and size of archived data. You can change the permissions on data you own with the chmod unix utility.

While you are preparing your data to be moved it is an opportunity to also document it, if you have not done so already. You should document what you are archiving and how you are archiving it. Even a simple readme file added to your main directory can help others and your future self. If you are archiving data underlining a publication or published dataset then it is important to have a summary of what is stored in /massdata. This is part of the dataset management plan and/or data availiability statement.

Useful tools:

         TAR- to create archives cheatsheet

         Compressing tools

Accessing MDSS

Massdata cannot be accessed directly via a directory path. All access of MDSS is via the command mdss.

Users connected to the project have read, write and execute permissions in the corresponding directory on mdss and so may create their own files in it.

mdss has several sub-commands and options to see all of them use either:

  $mdss --help
or
  $man mdss

If you don't specify a project, it will use your default project. Then you add a sub-command and the path of the files and directories you want to upload, list etc.

mdss -P <project-id> + <sub-command> + <path>

Most useful sub-commands are:

mdss put   - upload files 
mdss get   - retrieve files 
mdss ls    - list directories and files 
mdss dmdu  - get the size of a directory/file 
mdss dmls  - show what is on cache and what is on tape

NB "mdss du" will also work but only return the size of what is still cached, dmdu will give the full size of what is on tape regardless if it is cached or not.

Please note mdss sub-commands work only interactively or on the copyq queue. To use it on copyq remember to set the storage flag as

-l storage=massdata/<project_code>

Monitoring MDSS usage

Unfortunately, there is no command to check the usage by user-id as for /g/data and /scratch. The only way to get this information currently is to ask help<at>nci.org.au. The NCI administrators can access this information for any CI of the group.

Transferring data to and from MDSS

NCI also supports the netmv and netcp commands to work with MDSS. These commands create a copyq job to transfer multiple files. Files can be automatically tarred and compressed as part of the copy process.

Warning: The automatic archiving and compression of these tools can use a lot of storage on /scratch if you're moving lots of data!

For more info run `man netmv`.

The CMS team has also developed a utility called mdssdiff available from our conda environments. This utility allows users to compare the contents of the local directory and a directory under /massdata. It will also recursively update the content on the massdata directory to copy the local directory or vice versa.

Modifications to MDSS datasets

Ask the Lead CI of the project to contact NCI at help@nci.org.au if large metadata operations are needed on massdata, such as changing ownership, project code, permissions etc. of existing datasets