Difference between revisions of "Backup checklist"

 
(4 intermediate revisions by the same user not shown)
Line 16: Line 16:
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">they should be backed up, check what is repository strategy</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">they should be backed up, check what is repository strategy</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">data underlining publications or a PhD thesis has to be available in case of legal dispute for &nbsp;5 years from publication</span></span>   
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">data underlining publications or a PhD thesis has to be available in case of legal dispute for &nbsp;5 years from publication</span></span>   
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Database</span></span>
 +
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">​​​​​​​​​​​​​​usually hard to rebuild but small enough to be backed-up frequently</span></span> 
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Number of people accessing the data</span></span>  
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Number of people accessing the data</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Just you, your group, people from around the world</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Just you, your group, people from around the world</span></span>  
**&nbsp;   
+
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">This is particularly important if you are maintaining a web service via cloud, you should make sure you can recover your instance quickly and resume the service with the minimum interruption</span></span>    
 +
 
  
 
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''How big is the data?'''</span></span> ===
 
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''How big is the data?'''</span></span> ===
Line 24: Line 27:
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Text files - Source code, scripts, configuration files, small data files</span></span>  
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Text files - Source code, scripts, configuration files, small data files</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Sizes < 100 MB</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Sizes < 100 MB</span></span>  
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Not suitable for archives unless bundled into a tar file</span></span>   
+
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Not suitable for archives unless bundled into a [[TAR_guidelines|tar file]]</span></span>   
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Data files - NetCDF &c</span></span>  
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Data files - NetCDF &c</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">1 GB to 100's of GB</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">1 GB to 100's of GB</span></span>  
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Archive systems like tape are specifically designed for this</span></span>  
+
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Archive systems like tape are specifically designed for this</span></span>
  
 
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''How often is updated?'''</span></span> ===
 
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''How often is updated?'''</span></span> ===
Line 37: Line 40:
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Continually changing - source code under development</span></span>  
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Continually changing - source code under development</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Use automated revision control system - subversion, git</span></span>   
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Use automated revision control system - subversion, git</span></span>   
 +
 +
&nbsp;
  
 
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">How safe is the backup system?&nbsp;What happens if...</span></span>''' ===
 
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">How safe is the backup system?&nbsp;What happens if...</span></span>''' ===
Line 48: Line 53:
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Offsite/multiple backups</span></span>   
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Offsite/multiple backups</span></span>   
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">A flood/earthquake damages a city</span></span>  
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">A flood/earthquake damages a city</span></span>  
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Offsite/multiple backups with a wide separation</span></span>  
+
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Offsite/multiple backups with a wide separation</span></span>  
***<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Check regularly&nbsp;that you can actually recover the files?</span></span>    
+
 
 +
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Check regularly logs and that you can actually recover the files!!!</span></span>
  
 +
&nbsp;
  
 
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Options'''</span></span> ===
 
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Options'''</span></span> ===
Line 58: Line 65:
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Backed up - Institutions will have their own strategies</span></span>   
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Backed up - Institutions will have their own strategies</span></span>   
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">University usually offer backup option on shared drives for data on your laptop</span></span>  
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">University usually offer backup option on shared drives for data on your laptop</span></span>  
 +
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">reasonably large, but not suitable for huge amount of data</span></span>
 +
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">quick to retrieve</span></span>
 +
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">easy to&nbsp;automate</span></span> 
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Tape&nbsp;archives/ Data repositories</span></span>  
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Tape&nbsp;archives/ Data repositories</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Designed for archiving large and/or important data sets</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Designed for archiving large and/or important data sets</span></span>  
Line 72: Line 82:
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Git and GitHub, and other version control services,&nbsp;are only suitable for text files, not binary, so they are usually ok for code or readme files not for the actual data</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Git and GitHub, and other version control services,&nbsp;are only suitable for text files, not binary, so they are usually ok for code or readme files not for the actual data</span></span>  
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">NB some institutions might limit what services you can use based on security issues</span></span>   
 
**<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">NB some institutions might limit what services you can use based on security issues</span></span>   
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">snapshot of instances for VMs on Nectar and NCI clouds</span></span>  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">snapshot of instances for VMs on Nectar and NCI clouds</span></span>​​​​​​​
  
 
[[Category:Data induction]]
 
[[Category:Data induction]]

Latest revision as of 01:58, 30 July 2021

How essential is the data - does it have to be backed up?

  • Intermediate output, run logs
    • Not useful once an experiment is finished
  • Processing scripts
    • Useful to reproduce an experiment but could be redone
  • Model input
    • Essential to reproduce an experiment
  • Ease of reproducing the data
    • observations cannot be recreated
    • some input data might be available elsehwere but difficult to obtain 
    • can be reproduced but slow and cpu consuming process
    • easy to reproduce: backup code/workflow
  • Published results
    • they should be backed up, check what is repository strategy
    • data underlining publications or a PhD thesis has to be available in case of legal dispute for  5 years from publication
  • Database
    • ​​​​​​​​​​​​​​usually hard to rebuild but small enough to be backed-up frequently
  • Number of people accessing the data
    • Just you, your group, people from around the world
    • This is particularly important if you are maintaining a web service via cloud, you should make sure you can recover your instance quickly and resume the service with the minimum interruption


How big is the data?

  • Text files - Source code, scripts, configuration files, small data files
    • Sizes < 100 MB
    • Not suitable for archives unless bundled into a tar file
  • Data files - NetCDF &c
    • 1 GB to 100's of GB
    • Archive systems like tape are specifically designed for this

How often is updated?

  • Unchanging once produced - e.g. raw model output
    • Tape archives
  • It is updated and reviewed occasionally but not too frequently: post-processed output
    • external drives, cloud and IT services, faster to retrieve than tape
  • Continually changing - source code under development
    • Use automated revision control system - subversion, git

 

How safe is the backup system? What happens if...

  • You delete a file
  • The local file system fails
    • Back up on a different file system
  • Your storage provider goes bankrupt
    • Multiple backup providers
  • A fire destroys a building
    • Offsite/multiple backups
  • A flood/earthquake damages a city
    • Offsite/multiple backups with a wide separation

Check regularly logs and that you can actually recover the files!!!

 

Options

  • /home on NCI or other institution server
    • Limited space
    • Backed up - Institutions will have their own strategies
  • University usually offer backup option on shared drives for data on your laptop
    • reasonably large, but not suitable for huge amount of data
    • quick to retrieve
    • easy to automate
  • Tape archives/ Data repositories
    • Designed for archiving large and/or important data sets
    • Will have their own backup strategies - e.g. NCI tape is duplicated to two separate buildings at ANU
  • USB external hard drive
    • Simple way to have a separate backup, cheap
    • Can manage history manually or with something like time machine
    • Not suitable for long-term storage - disks have limited lifetime
  • Cloud storage, e.g. Dropbox, google drive, Github
    • Can access from anywhere
    • Offsite backup
    • Can set up folders to automatically be backed up
    • Is the service still going to be there in 5 years?
    • Git and GitHub, and other version control services, are only suitable for text files, not binary, so they are usually ok for code or readme files not for the actual data
    • NB some institutions might limit what services you can use based on security issues
  • snapshot of instances for VMs on Nectar and NCI clouds​​​​​​​