Difference between revisions of "Backup strategy"

m (P.petrelli moved page NCI Backups to Backup strategy)
Line 1: Line 1:
 +
 +
 
  
 
{{Template:Working_on}}
 
{{Template:Working_on}}
Line 10: Line 12:
 
 
 
 
  
A few weeks ago several hard drives in NCI's data centre failed, meaning the Vayu supercomputer and DCC had to be brought offline while repairs were made. Thankfully no data was lost due to the use of redundant components, however one more failed drive would have meant the loss of the /short filesystem. Model output generally goes to this filesystem, and because of its size NCI doesn't perform backups. This means that if you aren't performing backups yourself there is a possibility of losing data, so it's important to make plans to avoid this.
+
Examples
 +
 
 +
A few years ago several hard drives in NCI's data centre failed, meaning the supercomputer and main disk storage had to be brought offline while repairs were made. Thankfully no data was lost due to the use of redundant components, however one more failed drive would have meant the loss of the /short filesystem. Model output generally goes to this filesystem, and because of its size NCI doesn't perform backups. This means that if you aren't performing backups yourself there is a possibility of losing data, so it's important to make plans to avoid this.
 +
 
 +
Intro
 +
 
 +
A large amount of a researcher's time is spent in the production and manipulation of data. No technology is perfect, and sometimes disasters can happen, from a fire destroying your computer, a hard drive failing or just deleting the wrong directory. No-one wants to lose all of their data, results can be costly or even impossible to reproduce. It's important to have some idea of how you could recover if something were to happen to your files.
 +
 
 +
Strategy
 +
 
 +
What is essential?
  
A large amount of a researcher's time is spent in the production and manipulation of data. We make measurements, run simulations & perform analyses that make use of a vast amount of computer storage, and create programs and scripts to help us do things in a consistent and reproducible way. Modern research is heavily invested in making use of computers, which has caused an explosion in the amount of data that researchers can make use of.
+
The first thing to think about for backups is which part of your data is essential to have, especially when you have a lot of it. You may want to preserve output files from a 10,000 hour simulation, but not care about the log files once you've verified your results. Or perhaps you have performed observations of an unusual weather system which couldn't be easily reproduced, or scripts to create all the plots used in your latest paper. Also think about who else might be making use of your files, are other people using your results as input to their own simulations?
  
No technology is perfect, and sometimes disasters can happen, from a fire destroying your computer, a hard drive failing or just deleting the wrong directory. No-one wants to lose all of their data, results can be costly or even impossible to reproduce. It's important to have some idea of how you could recover if something were to happen to your files.
+
What are your vulnerable points?
  
The first thing to think about for backups is which part of your data is essential to have, especially when you have a lot of it. You may want to preserve output files from a 10,000 hour simulation, but not care about the log files once you've verified your results. Or perhaps you have performed observations of an unusual weather system which couldn't be easily reproduced, or scripts to create all the plots used in your latest paper. Also think about who else might be making use of your files, are other people using your results as input to their own simulations? Alternately, are you making use of files in other people's directories? Don't depend on other people keeping files in the same place.
+
Alternately, are you making use of files in other people's directories? Do not depend on other people keeping files in the same place.
 +
 
 +
How big is your data?
  
 
The next thing to consider is how big your data is. Your home directory can easily be backed up to a portable hard drive or to Dropbox, this is not really an option when you have terabytes of model output however. NCI provides a data archive called MDSS to archive large files, most institutions also have their own archives for important data created by their researchers.
 
The next thing to consider is how big your data is. Your home directory can easily be backed up to a portable hard drive or to Dropbox, this is not really an option when you have terabytes of model output however. NCI provides a data archive called MDSS to archive large files, most institutions also have their own archives for important data created by their researchers.
 +
 +
How often do you update your data?
  
 
You should also think about how often the data changes. Model output is unlikely to change, but programs are often improved with time. If you wanted to reproduce old results using a script you wrote some time ago, then you'd need to recover the state of the file at the time you first ran it. Revision control software like subversion and git is designed for this use case, there are a variety of hosting services available on the web that help you to manage software development.
 
You should also think about how often the data changes. Model output is unlikely to change, but programs are often improved with time. If you wanted to reproduce old results using a script you wrote some time ago, then you'd need to recover the state of the file at the time you first ran it. Revision control software like subversion and git is designed for this use case, there are a variety of hosting services available on the web that help you to manage software development.
 +
 +
Characteristic of differen backup options
  
 
Not all backup solutions provide the same protection. Creating a copy of a file in a separate directory on your computer provides a measure of protection against accidental deletion, however it could be lost if your hard drive fails. On the other end of the scale you could store copies of your data on archival tapes in different cities, so that it could still be recoverable in the case of natural disasters. Generally the latter is costly for a large data set, the amount of protection needed should be balanced against the value of your data.
 
Not all backup solutions provide the same protection. Creating a copy of a file in a separate directory on your computer provides a measure of protection against accidental deletion, however it could be lost if your hard drive fails. On the other end of the scale you could store copies of your data on archival tapes in different cities, so that it could still be recoverable in the case of natural disasters. Generally the latter is costly for a large data set, the amount of protection needed should be balanced against the value of your data.
  
Knowing all of this, what are the best options to use? The first thing to do is to protect your workstation. Check what is already being backed up by your local IT support, they may have an existing backup strategy. For instance NCI backs up user home directories on Vayu. Get an external hard drive at least as big as is in your computer, then either set up a cron job to automatically rsync your internal and external hard drives on linux or enable time machine if you use a mac. This gets you a backup of your whole workstation that you can use to recover all your installed programs and local data if a hard drive fails. You can also use a service like dropbox to automatically back up your home directory, these services limit the space you can use, however they provide a remote backup and allow you to access files from anywhere.
+
Knowing all of this, what are the best options to use?
 +
 
 +
The first thing to do is to protect your workstation. Check what is already being backed up by your local IT support, they may have an existing backup strategy. For instance NCI backs up user home directories on Vayu. Get an external hard drive at least as big as is in your computer, then either set up a cron job to automatically rsync your internal and external hard drives on linux or enable time machine if you use a mac. This gets you a backup of your whole workstation that you can use to recover all your installed programs and local data if a hard drive fails. You can also use a service like dropbox to automatically back up your home directory, these services limit the space you can use, however they provide a remote backup and allow you to access files from anywhere.
  
 
Ask the CMS team or your institution's data services about archiving services available to store large data sets. These services have their own data management plans in place to ensure the integrity of data, you may need to provide a data management plan describing the value of the data & how long it needs to be stored for.
 
Ask the CMS team or your institution's data services about archiving services available to store large data sets. These services have their own data management plans in place to ensure the integrity of data, you may need to provide a data management plan describing the value of the data & how long it needs to be stored for.
  
 
----
 
----
 +
 +
Checklist
  
 
*How essential is the data - does it have to be backed up?  
 
*How essential is the data - does it have to be backed up?  
Line 38: Line 60:
 
***Essential to reproduce an experiment   
 
***Essential to reproduce an experiment   
 
**Ease of reproducing the data  
 
**Ease of reproducing the data  
***Difficult to obtain weather readings vs. reproducable model output    
+
***observations cannot be recreated
 +
***some input data might be available elsehwere but difficult to obtain  
 +
***can be reproduced but slow and cpu consuming process
 +
***easy to reproduce: backup code/workflow    
 
**Published results  
 
**Published results  
***Research standards require archiving    
+
***they should be backed up, check what is repository strategy
 +
***data underlining publications or a PhD thesis has to be available in case of legal dispute for  5 years from publication    
 
**Number of people accessing the data  
 
**Number of people accessing the data  
 
***Just you, your group, people from around the world     
 
***Just you, your group, people from around the world     
  
 
*How big is the data?  
 
*How big is the data?  
**Text files - Source code, scripts, configuration files  
+
**Text files - Source code, scripts, configuration files, small data files  
***Sizes < 1 MB  
+
***Sizes < 100 MB  
 
***Not suitable for archives unless bundled into a tar file   
 
***Not suitable for archives unless bundled into a tar file   
 
**Data files - NetCDF &c  
 
**Data files - NetCDF &c  
***10MB to 100's of GB  
+
***1 GB to 100's of GB  
***Archive systems like MDSS designed for this     
+
***Archive systems like tape are specifically designed for this     
  
*How much history do you need?  
+
*How often is updated?  
**Unchanging once produced - e.g. model output  
+
**Unchanging once produced - e.g. raw model output  
 
***Tape archives   
 
***Tape archives   
 +
**It is updated and reviewed occasionally but not too frequently: post-processed output
 +
***external drives, cloud and IT services, faster to retrieve than tape 
 
**Continually changing - source code under development  
 
**Continually changing - source code under development  
***Revision control system - subversion, git     
+
***Use automated revision control system - subversion, git     
  
 
*How safe is the backup system?  
 
*How safe is the backup system?  
 
**What happens if...  
 
**What happens if...  
 
***You delete a file  
 
***You delete a file  
****Have backups 
 
 
***The local file system fails  
 
***The local file system fails  
 
****Back up on a different file system   
 
****Back up on a different file system   
Line 67: Line 94:
 
****Multiple backup providers   
 
****Multiple backup providers   
 
***A fire destroys a building  
 
***A fire destroys a building  
****Offsite backups   
+
****Offsite/multiple backups   
 
***A flood/earthquake damages a city  
 
***A flood/earthquake damages a city  
****Offsite backups with a wide separation     
+
****Offsite/multiple backups with a wide separation     
**Have you checked that you actually can recover?   
+
**Check regularly&nbsp;that you can actually recover the files?   
  
 
*Options  
 
*Options  
Line 76: Line 103:
 
***Limited space  
 
***Limited space  
 
***Backed up - Institutions will have their own strategies   
 
***Backed up - Institutions will have their own strategies   
**MDSS/University archives
+
**Tape&nbsp;archives/ Data repositories
 
***Designed for archiving large and/or important data sets  
 
***Designed for archiving large and/or important data sets  
***Will have their own backup strategies - e.g. MDSS is duplicated to two separate buildings at ANU   
+
***Will have their own backup strategies - e.g. NCI tape is duplicated to two separate buildings at ANU   
 
**USB hard drive  
 
**USB hard drive  
 
***Simple way to have a separate backup, cheap  
 
***Simple way to have a separate backup, cheap  

Revision as of 02:07, 13 July 2021

 

Template:Working on New page under construction

 

State how often the data will be backed up and to which locations. How many copies are being made? Storing data on laptops, computer hard drives or external storage devices alone is very risky. The use of robust, managed storage provided by university IT teams is preferable. Similarly, it is normally better to use automatic backup services provided by IT Services than rely on manual processes.

Have a recovery strategy worked out

 

Examples

A few years ago several hard drives in NCI's data centre failed, meaning the supercomputer and main disk storage had to be brought offline while repairs were made. Thankfully no data was lost due to the use of redundant components, however one more failed drive would have meant the loss of the /short filesystem. Model output generally goes to this filesystem, and because of its size NCI doesn't perform backups. This means that if you aren't performing backups yourself there is a possibility of losing data, so it's important to make plans to avoid this.

Intro

A large amount of a researcher's time is spent in the production and manipulation of data. No technology is perfect, and sometimes disasters can happen, from a fire destroying your computer, a hard drive failing or just deleting the wrong directory. No-one wants to lose all of their data, results can be costly or even impossible to reproduce. It's important to have some idea of how you could recover if something were to happen to your files.

Strategy

What is essential?

The first thing to think about for backups is which part of your data is essential to have, especially when you have a lot of it. You may want to preserve output files from a 10,000 hour simulation, but not care about the log files once you've verified your results. Or perhaps you have performed observations of an unusual weather system which couldn't be easily reproduced, or scripts to create all the plots used in your latest paper. Also think about who else might be making use of your files, are other people using your results as input to their own simulations?

What are your vulnerable points?

Alternately, are you making use of files in other people's directories? Do not depend on other people keeping files in the same place.

How big is your data?

The next thing to consider is how big your data is. Your home directory can easily be backed up to a portable hard drive or to Dropbox, this is not really an option when you have terabytes of model output however. NCI provides a data archive called MDSS to archive large files, most institutions also have their own archives for important data created by their researchers.

How often do you update your data?

You should also think about how often the data changes. Model output is unlikely to change, but programs are often improved with time. If you wanted to reproduce old results using a script you wrote some time ago, then you'd need to recover the state of the file at the time you first ran it. Revision control software like subversion and git is designed for this use case, there are a variety of hosting services available on the web that help you to manage software development.

Characteristic of differen backup options

Not all backup solutions provide the same protection. Creating a copy of a file in a separate directory on your computer provides a measure of protection against accidental deletion, however it could be lost if your hard drive fails. On the other end of the scale you could store copies of your data on archival tapes in different cities, so that it could still be recoverable in the case of natural disasters. Generally the latter is costly for a large data set, the amount of protection needed should be balanced against the value of your data.

Knowing all of this, what are the best options to use?

The first thing to do is to protect your workstation. Check what is already being backed up by your local IT support, they may have an existing backup strategy. For instance NCI backs up user home directories on Vayu. Get an external hard drive at least as big as is in your computer, then either set up a cron job to automatically rsync your internal and external hard drives on linux or enable time machine if you use a mac. This gets you a backup of your whole workstation that you can use to recover all your installed programs and local data if a hard drive fails. You can also use a service like dropbox to automatically back up your home directory, these services limit the space you can use, however they provide a remote backup and allow you to access files from anywhere.

Ask the CMS team or your institution's data services about archiving services available to store large data sets. These services have their own data management plans in place to ensure the integrity of data, you may need to provide a data management plan describing the value of the data & how long it needs to be stored for.


Checklist

  • How essential is the data - does it have to be backed up?
    • Intermediate output, run logs
      • Not useful once an experiment is finished
    • Processing scripts
      • Useful to reproduce an experiment but could be redone
    • Model input
      • Essential to reproduce an experiment
    • Ease of reproducing the data
      • observations cannot be recreated
      • some input data might be available elsehwere but difficult to obtain
      • can be reproduced but slow and cpu consuming process
      • easy to reproduce: backup code/workflow
    • Published results
      • they should be backed up, check what is repository strategy
      • data underlining publications or a PhD thesis has to be available in case of legal dispute for  5 years from publication
    • Number of people accessing the data
      • Just you, your group, people from around the world
  • How big is the data?
    • Text files - Source code, scripts, configuration files, small data files
      • Sizes < 100 MB
      • Not suitable for archives unless bundled into a tar file
    • Data files - NetCDF &c
      • 1 GB to 100's of GB
      • Archive systems like tape are specifically designed for this
  • How often is updated?
    • Unchanging once produced - e.g. raw model output
      • Tape archives
    • It is updated and reviewed occasionally but not too frequently: post-processed output
      • external drives, cloud and IT services, faster to retrieve than tape
    • Continually changing - source code under development
      • Use automated revision control system - subversion, git
  • How safe is the backup system?
    • What happens if...
      • You delete a file
      • The local file system fails
        • Back up on a different file system
      • Your storage provider goes bankrupt
        • Multiple backup providers
      • A fire destroys a building
        • Offsite/multiple backups
      • A flood/earthquake damages a city
        • Offsite/multiple backups with a wide separation
    • Check regularly that you can actually recover the files?
  • Options
    • /home at NCI/University workstation
      • Limited space
      • Backed up - Institutions will have their own strategies
    • Tape archives/ Data repositories
      • Designed for archiving large and/or important data sets
      • Will have their own backup strategies - e.g. NCI tape is duplicated to two separate buildings at ANU
    • USB hard drive
      • Simple way to have a separate backup, cheap
      • Can manage history manually or with something like time machine
      • Not suitable for long-term storage - disks have limited lifetimes
    • Cloud storage, e.g. Dropbox, Github
      • Can access from anywhere
      • Offsite backup
      • Can set up folders to automatically be backed up
      • Is the service still going to be there in 5 years?