Difference between revisions of "Archiving data"

Line 1: Line 1:
  
Massdata (Mass Data Storage System, MDSS for short) is the tape storage available at NCI. This kind of storage is intended for long term archiving of large files. Each project has a directory on the MDSS, the amount of storage allocated depends on the project allocation and can be checked using the nci_account command.
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Massdata (Mass Data Storage System, MDSS for short) is the tape storage available at NCI. This kind of storage is intended for long term archiving of large files. Each project has a directory on the MDSS, the amount of storage allocated depends on the project allocation and can be checked using the nci_account command.</span></span>
  
=== '''MDSS proper usage''' ===
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''MDSS proper usage'''</span></span> ===
  
MDSS is designed for medium to long-term archive of large files, so it is suitable for
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">MDSS is designed for medium to long-term archive of large files, so it is suitable for</span></span>
  
*Files you are required to keep, for example model outputs or configurations from published datasets, publications, PhD thesis etc.  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Files you are required to keep, for example model outputs or configurations from published datasets, publications, PhD thesis etc.</span></span>
*Files that you or someone else are likely to reuse or analyse again in the future but not in the next few months. For example restart files or other model output you are not immediately using should be moved from disk to mdss as soon as possible.  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Files that you or someone else are likely to reuse or analyse again in the future but not in the next few months. For example restart files or other model output you are not immediately using should be moved from disk to mdss as soon as possible.</span></span>
*MDSS is suitable to backup &nbsp;big data projects, like model output which could not be backed up elsewhere. It is not suitable for small amounts of data where using other backup options would be easier and more efficient. It is also not suitable for&nbsp;code files you might want to keep, for this you online services as Github or Bitbucket should be your preferred choice.  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">MDSS is suitable to backup &nbsp;big data projects, like model output which could not be backed up elsewhere. It is not suitable for small amounts of data where using other backup options would be easier and more efficient. It is also not suitable for&nbsp;code files you might want to keep, for this you online services as Github or Bitbucket should be your preferred choice.</span></span>
  
=== '''Guidelines for storage''' ===
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Guidelines for storage'''</span></span> ===
  
*Big files: if your files are small in size (less than 20Mb) then use tools like tar to bundle them into a single archive file  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Big files: if your files are small in size (less than 20Mb) then use tools like tar to bundle them into a single archive file</span></span>
*Files should be group readable, with group execute permissions for directories. This helps with long term maintenance, allowing administrators to track the type and size of archived data.  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Files should be group readable, with group execute permissions for directories. This helps with long term maintenance, allowing administrators to track the type and size of archived data.</span></span>
  
 
&nbsp;
 
&nbsp;
  
=== Accessing MDSS ===
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Accessing MDSS</span></span> ===
  
Massdata cannot be accessed directly via a directory path. All access of MDSS is via the command '''mdss.'''
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Massdata cannot be accessed directly via a directory path. All access of MDSS is via the command '''mdss.'''</span></span>
  
Users connected to the project have rwx permissions in that directory and so may create their own files in those areas.
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Users connected to the project have rwx permissions in the corresponding&nbsp;directory and so may create their own files in it.</span></span>
  
Mdss has several sub-commands and options to see all of them:
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Mdss has several sub-commands and options to see all of them:</span></span>
  
 +
&nbsp;
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
mdss --help or
 
mdss --help or
 
 
man mdss;
 
man mdss;
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Usually you specify the project, if you don't it will use your default project, and then add a sub-command and the path of the files and directories you want to upload, list etc.
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Usually you specify the project, if you don't it will use your default project, and then add a sub-command and the path of the files and directories you want to upload, list etc.</span></span>
  
mdss -P <project-id>;+ <sub-command> + <path>
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">mdss -P <project-id>;+ <sub-command> + <path></span></span>
Most useful sub-commands are:
 
  
 +
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Most useful sub-commands are:</span></span>
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
*mdss put&nbsp;- to upload files  
+
mdss put   - upload files  
*mdss get&nbsp;- to retrieve files  
+
mdss get   - retrieve files  
*mdss ls &nbsp;- to list directories and files  
+
mdss ls   - list directories and files  
*mdss dmdu&nbsp;&nbsp;- to get the size of a directory  
+
mdss dmdu - get the size of a directory/file
 +
mdss dmls  - show what is on cache and what is on tape
  
NB &nbsp;"mdss du" will also work but only return the size of what is still cached, dmdu will give the full size of what is on tape regardless that is cached or not.
+
NB "mdss du" will also work but only return the size of what is still cached, dmdu will give the full size of what is on tape regardless that is cached or not.
  
*mdss dmls - to see what is currently online in the cache, what is on tape
+
  </syntaxhighlight>
<syntaxhighlight>
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Please note mdss commands work only interactively or with ‘copyq’</span></span>
Please note mdss commands work only interactively or with ‘copyq’
+
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Preparing your data for mdss</span></span>''' ===
  
=== Preparing your data for mdss ===
+
#<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Organise your files and delete anything which you will not&nbsp;be re-using. It is tempting to copy entire directories as they are thinking you will get&nbsp;back to them again later. There is currently no easy way to list what you are storing on massdata and so trying to tidy up after you uploaded your files would be slow and painful. Even more than with other storage options, it is really important to put there only suitable files and make sure that they have been compressed and tarred together if necessary.</span></span>
 +
#<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">NCI guidelines suggest a minimum size of 20MB per file and an average size of 250MB.</span></span>
  
#Organise your files and delete anything which you will not&nbsp;be re-using. It is tempting to copy entire directories as they are thinking you will get&nbsp;back to them again later. There is currently no easy way to list what you are storing on massdata and so trying to tidy up after you uploaded your files would be slow and painful. Even more than with other storage options, it is really important to put there only suitable files and make sure that they have been compressed and tarred together if necessary.
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">While you are preparing your data to be moved it is an opportunity to also document, if you have not done so already, what you are archiving and how. Even a simple readme file added to your main directory can help others and your future self. If you are archiving data underlying a publication or published dataset then it is important to have a summary of what is stored in /massdata and how is part of the [[Data_Management_Plan|dataset management plan]] and/or [[Data_Availability_Statement|data availiability statement]].</span></span>
#NCI guidelines suggest a minimum size of 20MB per file and an average size of 250MB.  
 
  
While you are preparing your data to be moved it is an opportunity to also document, if you haven’t done so already, what you are archiving and how. Even a simple readme file added to your main directory can help others and your future self. If you are archiving data underlying a publication or published dataset then it is important a summary of what is stored in /massdata and how is part of the dataset management plan.
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Useful tools:</span></span>
  
Useful tools:
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;[https://docs.google.com/document/d/1do9dtyzTQ5VY3yFC6YEj-fhAW7OnkA5jFzol1vjRu4w/edit?usp=sharing TAR - to create archives]</span></span>
  
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;[https://docs.google.com/document/d/1do9dtyzTQ5VY3yFC6YEj-fhAW7OnkA5jFzol1vjRu4w/edit?usp=sharing TAR - to create archives]
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;[[NetCDF_Compression_Tools|Compressing tools]]</span></span>
  
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;[[NetCDF_Compression_Tools|Compressing tools]]
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Monitoring mdss usage'''</span></span> ===
  
=== Monitoring mdss usage ===
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">In the past you could nci_account to monitor the allocation and how much of it was still available. Currently this is not possible anymore so particularly if you want to move a big amount of data to mdss you should first check with the lead CI of the project you wan tto use, to make sure enough storage is available.</span></span>
  
In the past you could nci_account to monitor the allocation and how much of it was still available. Currently this is not possible anymore so particularly if you want to move a big amount of data to mdss you should first check with the lead CI of the project you wan tto use, to make sure enough storage is available.
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Unfortunately, there is also not a command to check quickly usage by user-id as for /g/data and /short. The only way to get this information currently is to ask help@nci.org.au, administrators can access this information for any CI of the group.</span></span>
Unfortunately, there is also not a command to check quickly usage by user-id as for /g/data and /short. The only way to get this information currently is to ask help@nci.org.au, administrators can access this information for any CI of the group.  
 
=== Transferring data to and from MDSS ===
 
  
NCI supports different commands to work with MDSS as it is explained on their User Guide. The CMS team has also developed a utility called mdssdiff. This utility allows users to compare the contents of the local directory and a directory under /massdata. It will also recursively update the content on the massdata directory to copy the local directory or vice versa.
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Transferring data to and from MDSS'''</span></span> ===
  
=== Modifications to MDSS datasets ===
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">NCI supports different commands to work with MDSS as it is explained on their User Guide. The CMS team has also developed a utility called mdssdiff. This utility allows users to compare the contents of the local directory and a directory under /massdata. It will also recursively update the content on the massdata directory to copy the local directory or vice versa.</span></span>
  
Contact NCI at help@nci.org.au if large metadata operations are needed on massdata, as changing ownership, project code, permissions etc. of existing datasets
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Modifications to MDSS datasets'''</span></span> ===
 +
 
 +
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Contact NCI at help@nci.org.au if large metadata operations are needed on massdata, as changing ownership, project code, permissions etc. of existing datasets</syntaxhighlight></span></span>
 +
 
 +
&nbsp;
  
[[Category:NCI Guidelines]]
+
[[Category:NCI Guidelines]][[Category:Data induction]]

Revision as of 22:29, 15 July 2021

Massdata (Mass Data Storage System, MDSS for short) is the tape storage available at NCI. This kind of storage is intended for long term archiving of large files. Each project has a directory on the MDSS, the amount of storage allocated depends on the project allocation and can be checked using the nci_account command.

MDSS proper usage

MDSS is designed for medium to long-term archive of large files, so it is suitable for

  • Files you are required to keep, for example model outputs or configurations from published datasets, publications, PhD thesis etc.
  • Files that you or someone else are likely to reuse or analyse again in the future but not in the next few months. For example restart files or other model output you are not immediately using should be moved from disk to mdss as soon as possible.
  • MDSS is suitable to backup  big data projects, like model output which could not be backed up elsewhere. It is not suitable for small amounts of data where using other backup options would be easier and more efficient. It is also not suitable for code files you might want to keep, for this you online services as Github or Bitbucket should be your preferred choice.

Guidelines for storage

  • Big files: if your files are small in size (less than 20Mb) then use tools like tar to bundle them into a single archive file
  • Files should be group readable, with group execute permissions for directories. This helps with long term maintenance, allowing administrators to track the type and size of archived data.

 

Accessing MDSS

Massdata cannot be accessed directly via a directory path. All access of MDSS is via the command mdss.

Users connected to the project have rwx permissions in the corresponding directory and so may create their own files in it.

Mdss has several sub-commands and options to see all of them:

 

mdss --help or
man mdss;

Usually you specify the project, if you don't it will use your default project, and then add a sub-command and the path of the files and directories you want to upload, list etc.

mdss -P <project-id>;+ <sub-command> + <path>

Most useful sub-commands are:

mdss put   - upload files 
mdss get   - retrieve files 
mdss ls    - list directories and files 
mdss dmdu  - get the size of a directory/file 
mdss dmls  - show what is on cache and what is on tape

NB "mdss du" will also work but only return the size of what is still cached, dmdu will give the full size of what is on tape regardless that is cached or not.

  

Please note mdss commands work only interactively or with ‘copyq’

Preparing your data for mdss

  1. Organise your files and delete anything which you will not be re-using. It is tempting to copy entire directories as they are thinking you will get back to them again later. There is currently no easy way to list what you are storing on massdata and so trying to tidy up after you uploaded your files would be slow and painful. Even more than with other storage options, it is really important to put there only suitable files and make sure that they have been compressed and tarred together if necessary.
  2. NCI guidelines suggest a minimum size of 20MB per file and an average size of 250MB.

While you are preparing your data to be moved it is an opportunity to also document, if you have not done so already, what you are archiving and how. Even a simple readme file added to your main directory can help others and your future self. If you are archiving data underlying a publication or published dataset then it is important to have a summary of what is stored in /massdata and how is part of the dataset management plan and/or data availiability statement.

Useful tools:

         TAR - to create archives

         Compressing tools

Monitoring mdss usage

In the past you could nci_account to monitor the allocation and how much of it was still available. Currently this is not possible anymore so particularly if you want to move a big amount of data to mdss you should first check with the lead CI of the project you wan tto use, to make sure enough storage is available.

Unfortunately, there is also not a command to check quickly usage by user-id as for /g/data and /short. The only way to get this information currently is to ask help@nci.org.au, administrators can access this information for any CI of the group.

Transferring data to and from MDSS

NCI supports different commands to work with MDSS as it is explained on their User Guide. The CMS team has also developed a utility called mdssdiff. This utility allows users to compare the contents of the local directory and a directory under /massdata. It will also recursively update the content on the massdata directory to copy the local directory or vice versa.

Modifications to MDSS datasets

Contact NCI at help@nci.org.au if large metadata operations are needed on massdata, as changing ownership, project code, permissions etc. of existing datasets</syntaxhighlight>