Difference between revisions of "Backup checklist"

Line 53: Line 53:
  
 
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Check regularly logs and that you can actually recover the files!!!</span></span>
 
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Check regularly logs and that you can actually recover the files!!!</span></span>
 +
  
 
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Options'''</span></span> ===
 
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Options'''</span></span> ===
Line 60: Line 61:
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Backed up - Institutions will have their own strategies</span></span>   
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Backed up - Institutions will have their own strategies</span></span>   
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">University usually offer backup option on shared drives for data on your laptop</span></span>  
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">University usually offer backup option on shared drives for data on your laptop</span></span>  
 +
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">reasonably large, but not suitable for huge amount of data</span></span>
 +
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">quick to retrieve</span></span>
 +
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">easy to&nbsp;automate</span></span> 
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Tape&nbsp;archives/ Data repositories</span></span>  
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Tape&nbsp;archives/ Data repositories</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Designed for archiving large and/or important data sets</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Designed for archiving large and/or important data sets</span></span>  
Line 74: Line 78:
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Git and GitHub, and other version control services,&nbsp;are only suitable for text files, not binary, so they are usually ok for code or readme files not for the actual data</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Git and GitHub, and other version control services,&nbsp;are only suitable for text files, not binary, so they are usually ok for code or readme files not for the actual data</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">NB some institutions might limit what services you can use based on security issues</span></span>   
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">NB some institutions might limit what services you can use based on security issues</span></span>   
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">snapshot of instances for VMs on Nectar and NCI clouds</span></span>  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">snapshot of instances for VMs on Nectar and NCI clouds</span></span>​​​​​​​
  
 
[[Category:Data induction]]
 
[[Category:Data induction]]

Revision as of 19:53, 13 July 2021


How essential is the data - does it have to be backed up?

  • Intermediate output, run logs
    • Not useful once an experiment is finished
  • Processing scripts
    • Useful to reproduce an experiment but could be redone
  • Model input
    • Essential to reproduce an experiment
  • Ease of reproducing the data
    • observations cannot be recreated
    • some input data might be available elsehwere but difficult to obtain 
    • can be reproduced but slow and cpu consuming process
    • easy to reproduce: backup code/workflow
  • Published results
    • they should be backed up, check what is repository strategy
    • data underlining publications or a PhD thesis has to be available in case of legal dispute for  5 years from publication
  • Number of people accessing the data
    • Just you, your group, people from around the world
    • This is particularly important if you are maintaining a web service via cloud, you should make sure you can recover your instance quickly and resume the service with the minimum interruption

How big is the data?

  • Text files - Source code, scripts, configuration files, small data files
    • Sizes < 100 MB
    • Not suitable for archives unless bundled into a tar file
  • Data files - NetCDF &c
    • 1 GB to 100's of GB
    • Archive systems like tape are specifically designed for this

How often is updated?

  • Unchanging once produced - e.g. raw model output
    • Tape archives
  • It is updated and reviewed occasionally but not too frequently: post-processed output
    • external drives, cloud and IT services, faster to retrieve than tape
  • Continually changing - source code under development
    • Use automated revision control system - subversion, git


How safe is the backup system? What happens if...

  • You delete a file
  • The local file system fails
    • Back up on a different file system
  • Your storage provider goes bankrupt
    • Multiple backup providers
  • A fire destroys a building
    • Offsite/multiple backups
  • A flood/earthquake damages a city
    • Offsite/multiple backups with a wide separation

Check regularly logs and that you can actually recover the files!!!


Options

  • /home on NCI or other institution server
    • Limited space
    • Backed up - Institutions will have their own strategies
  • University usually offer backup option on shared drives for data on your laptop
    • reasonably large, but not suitable for huge amount of data
    • quick to retrieve
    • easy to automate
  • Tape archives/ Data repositories
    • Designed for archiving large and/or important data sets
    • Will have their own backup strategies - e.g. NCI tape is duplicated to two separate buildings at ANU
  • USB external hard drive
    • Simple way to have a separate backup, cheap
    • Can manage history manually or with something like time machine
    • Not suitable for long-term storage - disks have limited lifetime
  • Cloud storage, e.g. Dropbox, google drive, Github
    • Can access from anywhere
    • Offsite backup
    • Can set up folders to automatically be backed up
    • Is the service still going to be there in 5 years?
    • Git and GitHub, and other version control services, are only suitable for text files, not binary, so they are usually ok for code or readme files not for the actual data
    • NB some institutions might limit what services you can use based on security issues
  • snapshot of instances for VMs on Nectar and NCI clouds​​​​​​​