Difference between revisions of "Backup strategy"

 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
  
 
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">A large amount of a researcher's time is spent in the production and manipulation of data. No technology is perfect, and data loss can happen, from losing&nbsp;your computer, to a hard drive failing or just deleting the wrong directory. No-one wants to lose&nbsp;data; results can be costly or even impossible to reproduce. It is important to have some idea of how you could recover them if something were to happen to your files. Consider also situations in which the data is not irretrievably lost but might not be available for a while, this could be an issue if you have a deadline to meet.</span></span></span></span>
  
{{Template:Working_on}}
+
== <span style="font-size:large;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="color:#000000"><span style="caret-color:#000000">Have a recovery strategy worked out</span></span>'''</span></span> ==
  
&nbsp;
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">You do not have to, but it might help writing down your backup strategy, both to make sure you included everything and to remember what you did for particular files.&nbsp;State how often the data will be backed up and to which locations. This will be probably be different depending on the kind of data. How many copies are being made?</span></span></span></span>
  
State how often the data will be backed up and to which locations. How many copies are being made? Storing data on laptops, computer hard drives or external storage devices alone is very risky. The use of robust, managed storage provided by university IT teams is preferable. Similarly, it is normally better to use automatic backup services provided by IT Services than rely on manual processes.
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><font color="#000000"><span style="caret-color: rgb(0, 0, 0);">Add details on how to recover the files, it is easy to forget and if there are logs created from the backup action where they are located.</span></font></span></span>
  
Have a recovery strategy worked out
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="color:#000000"><span style="caret-color:#000000">What is essential?</span></span>'''</span></span> ===
  
&nbsp;
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">The first thing to consider is which part of your data is essential to have, especially when you have a lot of it. You may want to preserve output files from a very long simulation, but not care about the log files once you have verified your results. Or perhaps you have collected observations, and these cannot be reproduced. Particularly if they are observations of a rare phenomenon. The scripts to create all the plots used in your latest paper can probably be reproduced but with lots of time and effort.&nbsp;</span></span></span></span>
  
Examples
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="caret-color:#000000"><span style="color:#000000">Also think about who else might be making use of your files, are other people using your results in their own research projects?</span></span></span></span>
  
A few years ago several hard drives in NCI's data centre failed, meaning the supercomputer and main disk storage had to be brought offline while repairs were made. Thankfully no data was lost due to the use of redundant components, however one more failed drive would have meant the loss of the /short filesystem. Model output generally goes to this filesystem, and because of its size NCI doesn't perform backups. This means that if you aren't performing backups yourself there is a possibility of losing data, so it's important to make plans to avoid this.
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="color:#000000"><span style="caret-color:#000000">What data is most vulnerable?</span></span>'''</span></span> ===
  
Intro
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Not all the storage we use is identical, it is more likely to have a hard drive failure on your laptop then on one the supercomputer or university drives. They used hard drives that partially protect from data loss. Some services might be more reliable than others, including backup services. Make sure you know the characteristic of the storage you are using or intend to use, check if it is already backed up. Consider also that technology changes and hard drives deteriorate with time, so the age of the storage you used is also a factor to consider.</span></span>
  
A large amount of a researcher's time is spent in the production and manipulation of data.&nbsp;No technology is perfect, and sometimes disasters can happen, from a fire destroying your computer, a hard drive failing or just deleting the wrong directory. No-one wants to lose all of their data, results can be costly or even impossible to reproduce. It's important to have some idea of how you could recover if something were to happen to your files.
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="caret-color:#000000"><span style="color:#000000">If you are you using files managed by others, make sure you know how they are managed. If they are likely to be removed and if they are backed up.</span></span></span></span>
  
Strategy
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''What are you protecting against?'''</span></span> ===
  
What is essential?
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Apart from data loss you might also&nbsp;be concerned about your data being unavailable. Similarly, you might want to retrieve older versions of files, this is usually more likely for documents and code. If you have completed a project a need to preserve the data then what you really thinking of is archiving which is a different process, then you should look at archiving options.&nbsp;</span></span>
  
The first thing to think about for backups is which part of your data is essential to have, especially when you have a lot of it. You may want to preserve output files from a 10,000 hour simulation, but not care about the log files once you've verified your results. Or perhaps you have performed observations of an unusual weather system which couldn't be easily reproduced, or scripts to create all the plots used in your latest paper. Also think about who else might be making use of your files, are other people using your results as input to their own simulations?
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="caret-color:#000000"><span style="color:#000000">How big is your data?</span></span>'''</span></span> ===
  
What are your vulnerable points?
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">The next thing to consider is how big your data is. Your home directory can easily be backed up to a portable hard drive or to Dropbox, this is not really an option when you have terabytes of model output. However, supercomputer providers use a tape data archive, known as MDSS at NCI, to archive large files. Most institutions also have their own archives for important data created by their researchers.&nbsp;</span></span></span></span>
  
Alternately, are you making use of files in other people's directories? Do not depend on other people keeping files in the same place.
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="color:#000000"><span style="caret-color:#000000">How often do you update your data?</span></span>'''</span></span> ===
  
How big is your data?
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">You should also think about how often the data changes, this will determine how often you will need to back it up. Codes&nbsp;change often and using a version control system is the only way to really keep track of all the versions. Similarly, you might want to automatically update your documents on a daily basis.&nbsp;Model output or recorded observations are unlikely to change, you might need to back them up only once. Post processed output will change more frequently while you are still conducting your analysis, it might be worth to have some level of automation and back up your files every few days.</span></span></span></span>
  
The next thing to consider is how big your data is. Your home directory can easily be backed up to a portable hard drive or to Dropbox, this is not really an option when you have terabytes of model output however. NCI provides a data archive called MDSS to archive large files, most institutions also have their own archives for important data created by their researchers.
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="color:#000000"><span style="caret-color:#000000">Characteristic of different backup options</span></span>'''</span></span> ===
  
How often do you update your data?
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">Not all backup solutions provide the same protection. Creating a copy of a file in a separate directory on your computer provides a measure of protection against accidental deletion, however it could be lost if your hard drive fails. On the other end of the scale you could store copies of your data on archival tapes in different cities, so that it could still be recoverable in the case of natural disasters. Generally, the latter is costly for a large data set, the amount of protection needed should be balanced against the value of your data.</span></span></span></span>
  
You should also think about how often the data changes. Model output is unlikely to change, but programs are often improved with time. If you wanted to reproduce old results using a script you wrote some time ago, then you'd need to recover the state of the file at the time you first ran it. Revision control software like subversion and git is designed for this use case, there are a variety of hosting services available on the web that help you to manage software development.
+
=== <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="color:#000000"><span style="caret-color:#000000">Set up regular checks</span></span>'''</span></span> ===
  
Characteristic of differen backup options
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">Backup automated jobs can also fail, so make sure regularly you can access your backed-up files and they are getting updated as expected. A server restart could stop your cronjob from happening, changes in permissions might make the backup fail. If the backup has logs check them for error and fix them asap.</span></span></span></span>
  
Not all backup solutions provide the same protection. Creating a copy of a file in a separate directory on your computer provides a measure of protection against accidental deletion, however it could be lost if your hard drive fails. On the other end of the scale you could store copies of your data on archival tapes in different cities, so that it could still be recoverable in the case of natural disasters. Generally the latter is costly for a large data set, the amount of protection needed should be balanced against the value of your data.
+
== <span style="font-size:large;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="color:#000000"><span style="caret-color:#000000">What backup options are available?</span></span>'''</span></span> ==
  
Knowing all of this, what are the best options to use?
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="color:#000000"><span style="caret-color:#000000">University IT services</span></span>'''</span></span>
  
The first thing to do is to protect your workstation. Check what is already being backed up by your local IT support, they may have an existing backup strategy. For instance NCI backs up user home directories on Vayu. Get an external hard drive at least as big as is in your computer, then either set up a cron job to automatically rsync your internal and external hard drives on linux or enable time machine if you use a mac. This gets you a backup of your whole workstation that you can use to recover all your installed programs and local data if a hard drive fails. You can also use a service like dropbox to automatically back up your home directory, these services limit the space you can use, however they provide a remote backup and allow you to access files from anywhere.
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">Most universities will provide shared storage&nbsp;on their local network that can be used to setup automatic&nbsp;backups. These services are&nbsp;robust and properly&nbsp;managed storage, so they are an excellent choice.&nbsp;You can easily select files from your laptop to backup and is usually fast to retrieve them should you need to. This is also useful if you are switching to a new computer.&nbsp;Check what is already being backed up by your local IT support, they may be backing up some part of their servers by default. For instance, NCI backups user home directories on their servers.</span></span></span></span>
  
Ask the CMS team or your institution's data services about archiving services available to store large data sets. These services have their own data management plans in place to ensure the integrity of data, you may need to provide a data management plan describing the value of the data & how long it needs to be stored for.
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="color:#000000"><span style="caret-color:#000000">Tape and other institutional&nbsp;data repositories</span></span>'''</span></span>
  
----
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">For anything which is really big, which is quite common in climate science, you can use tape. This is less accessible then other options,&nbsp;but it has a large capacity and it is optimised for storing large files. Tape is also a good choice for archiving&nbsp;however, you should make sure this is its intended use.&nbsp;You can learn more how to do so in our [[Archiving_data|archiving data]]&nbsp;page.</span></span></span></span>
  
Checklist
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Institutional or discipline&nbsp;data repositories'''</span></span>
  
*How essential is the data - does it have to be backed up?
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">These are not, strictly speaking, backup services but if you have a finished data product this&nbsp;might be suitable for preservation.&nbsp;Ask the CMS team or your institution's data services about archiving services available to store large data sets. These services have their own data management plans in place to ensure the integrity of data, you may need to provide a data management plan describing the value of the data & how long it needs to be stored for. Similarly publishing your data and making it available is a good way to preserve it as well as sharing it.</span></span></span></span>
**Intermediate output, run logs
 
***Not useful once an experiment is finished  
 
**Processing scripts
 
***Useful to reproduce an experiment but could be redone 
 
**Model input
 
***Essential to reproduce an experiment 
 
**Ease of reproducing the data  
 
***observations cannot be recreated
 
***some&nbsp;input data might be available elsehwere but difficult to obtain
 
***can be reproduced&nbsp;but slow and cpu consuming process
 
***easy to reproduce: backup&nbsp;code/workflow 
 
**Published results
 
***they should be backed up, check what is repository strategy
 
***data underlining publications or a PhD thesis has to be available in case of legal dispute for &nbsp;5 years from publication 
 
**Number of people accessing the data  
 
***Just you, your group, people from around the world   
 
  
*How big is the data?
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="color:#000000"><span style="caret-color:#000000">External hard drive</span></span>'''</span></span>
**Text files - Source code, scripts, configuration files, small data files
 
***Sizes < 100 MB
 
***Not suitable for archives unless bundled into a tar file 
 
**Data files - NetCDF &c
 
***1 GB to 100's of GB
 
***Archive systems like tape are specifically designed for this   
 
  
*How often is updated?
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">Get an external hard drive at least as big as is in your computer, then either set up a cron job to automatically rsync your internal and external hard drives on linux or enable time machine if you use a mac. This gets you a backup of your whole workstation that you can use to recover all your installed programs and local data if a hard drive fails.</span></span></span></span>
**Unchanging once produced - e.g. raw model output
 
***Tape archives 
 
**It is updated and reviewed occasionally but not too frequently: post-processed output
 
***external drives, cloud and IT services, faster to retrieve than tape 
 
**Continually changing - source code under development
 
***Use automated revision control system - subversion, git   
 
  
*How safe is the backup system?
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Cloud'''</span></span>
**What happens if...
 
***You delete a file
 
***The local file system fails
 
****Back up on a different file system 
 
***Your storage provider goes bankrupt
 
****Multiple backup providers 
 
***A fire destroys a building
 
****Offsite/multiple backups 
 
***A flood/earthquake damages a city
 
****Offsite/multiple backups with a wide separation   
 
**Check regularly&nbsp;that you can actually recover the files? 
 
  
*Options
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">You can also use a service like dropbox, google drive to automatically back up your home directory, these services limit the space you can use, however they provide a remote backup and allow you to access files from anywhere.</span></span></span></span>
**/home at NCI/University workstation
 
***Limited space
 
***Backed up - Institutions will have their own strategies 
 
**Tape&nbsp;archives/ Data repositories
 
***Designed for archiving large and/or important data sets
 
***Will have their own backup strategies - e.g. NCI tape is duplicated to two separate buildings at ANU 
 
**USB hard drive
 
***Simple way to have a separate backup, cheap
 
***Can manage history manually or with something like time machine
 
***Not suitable for long-term storage - disks have limited lifetimes 
 
**Cloud storage, e.g. Dropbox, Github
 
***Can access from anywhere
 
***Offsite backup
 
***Can set up folders to automatically be backed up  
 
***Is the service still going to be there in 5 years?   
 
  
[[Category:Data]]
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="color:#000000"><span style="caret-color:#000000">Laptop</span></span>'''</span></span>
 +
 
 +
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">Your own laptop can be used to backup files you normally keep on the cloud or on a different server. Do this only for relatively small amount of data. If you want to keep extra copies of a code also using a shared server is fine. Do not use storage on disk at NCI or another server, or Nectar cloud resources&nbsp;to backup big amounts of data!!&nbsp;Storage is allocated to projects and it is shared on these servers and backup is not a good way to use this resource.&nbsp;</span></span></span></span>
 +
 
 +
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Snapshots for VM instances'''</span></span>
 +
 
 +
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">If you are managing cloud servers via NCI cloud or Nectar, take regular snapshots of your instances. It is not unlikely to bump into issues when updating a VM, and it is really useful to be able to restart your instance from a working state.</span></span></span></span>
 +
 
 +
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''<span style="color:#000000"><span style="caret-color:#000000">Version control</span></span>'''</span></span>
 +
 
 +
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;"><span style="color:#000000"><span style="caret-color:#000000">Version control is a must for your code and for any text files, as they change frequently, and you might want to recover older versions. It is not a good option for data files as they are usually binary. However, you can use it to keep track of changes to your data by using it for readme files describing your workflow. Here you can&nbsp;learn more about [[Git_Introduction|git and GitHub&nbsp;]]</span></span></span></span>
 +
 
 +
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">We provide a&nbsp;[[Backup_checklist|backup checklist]]&nbsp;on another wiki page for your convenience. [https://support.ehelp.edu.au/support/solutions/articles/6000085112-backing-up-data Nectar] also has an informative page on backup strategies.</span></span>
 +
 
 +
[[Category:Data induction]]

Latest revision as of 01:44, 30 July 2021

A large amount of a researcher's time is spent in the production and manipulation of data. No technology is perfect, and data loss can happen, from losing your computer, to a hard drive failing or just deleting the wrong directory. No-one wants to lose data; results can be costly or even impossible to reproduce. It is important to have some idea of how you could recover them if something were to happen to your files. Consider also situations in which the data is not irretrievably lost but might not be available for a while, this could be an issue if you have a deadline to meet.

Have a recovery strategy worked out

You do not have to, but it might help writing down your backup strategy, both to make sure you included everything and to remember what you did for particular files. State how often the data will be backed up and to which locations. This will be probably be different depending on the kind of data. How many copies are being made?

Add details on how to recover the files, it is easy to forget and if there are logs created from the backup action where they are located.

What is essential?

The first thing to consider is which part of your data is essential to have, especially when you have a lot of it. You may want to preserve output files from a very long simulation, but not care about the log files once you have verified your results. Or perhaps you have collected observations, and these cannot be reproduced. Particularly if they are observations of a rare phenomenon. The scripts to create all the plots used in your latest paper can probably be reproduced but with lots of time and effort. 

Also think about who else might be making use of your files, are other people using your results in their own research projects?

What data is most vulnerable?

Not all the storage we use is identical, it is more likely to have a hard drive failure on your laptop then on one the supercomputer or university drives. They used hard drives that partially protect from data loss. Some services might be more reliable than others, including backup services. Make sure you know the characteristic of the storage you are using or intend to use, check if it is already backed up. Consider also that technology changes and hard drives deteriorate with time, so the age of the storage you used is also a factor to consider.

If you are you using files managed by others, make sure you know how they are managed. If they are likely to be removed and if they are backed up.

What are you protecting against?

Apart from data loss you might also be concerned about your data being unavailable. Similarly, you might want to retrieve older versions of files, this is usually more likely for documents and code. If you have completed a project a need to preserve the data then what you really thinking of is archiving which is a different process, then you should look at archiving options. 

How big is your data?

The next thing to consider is how big your data is. Your home directory can easily be backed up to a portable hard drive or to Dropbox, this is not really an option when you have terabytes of model output. However, supercomputer providers use a tape data archive, known as MDSS at NCI, to archive large files. Most institutions also have their own archives for important data created by their researchers. 

How often do you update your data?

You should also think about how often the data changes, this will determine how often you will need to back it up. Codes change often and using a version control system is the only way to really keep track of all the versions. Similarly, you might want to automatically update your documents on a daily basis. Model output or recorded observations are unlikely to change, you might need to back them up only once. Post processed output will change more frequently while you are still conducting your analysis, it might be worth to have some level of automation and back up your files every few days.

Characteristic of different backup options

Not all backup solutions provide the same protection. Creating a copy of a file in a separate directory on your computer provides a measure of protection against accidental deletion, however it could be lost if your hard drive fails. On the other end of the scale you could store copies of your data on archival tapes in different cities, so that it could still be recoverable in the case of natural disasters. Generally, the latter is costly for a large data set, the amount of protection needed should be balanced against the value of your data.

Set up regular checks

Backup automated jobs can also fail, so make sure regularly you can access your backed-up files and they are getting updated as expected. A server restart could stop your cronjob from happening, changes in permissions might make the backup fail. If the backup has logs check them for error and fix them asap.

What backup options are available?

University IT services

Most universities will provide shared storage on their local network that can be used to setup automatic backups. These services are robust and properly managed storage, so they are an excellent choice. You can easily select files from your laptop to backup and is usually fast to retrieve them should you need to. This is also useful if you are switching to a new computer. Check what is already being backed up by your local IT support, they may be backing up some part of their servers by default. For instance, NCI backups user home directories on their servers.

Tape and other institutional data repositories

For anything which is really big, which is quite common in climate science, you can use tape. This is less accessible then other options, but it has a large capacity and it is optimised for storing large files. Tape is also a good choice for archiving however, you should make sure this is its intended use. You can learn more how to do so in our archiving data page.

Institutional or discipline data repositories

These are not, strictly speaking, backup services but if you have a finished data product this might be suitable for preservation. Ask the CMS team or your institution's data services about archiving services available to store large data sets. These services have their own data management plans in place to ensure the integrity of data, you may need to provide a data management plan describing the value of the data & how long it needs to be stored for. Similarly publishing your data and making it available is a good way to preserve it as well as sharing it.

External hard drive

Get an external hard drive at least as big as is in your computer, then either set up a cron job to automatically rsync your internal and external hard drives on linux or enable time machine if you use a mac. This gets you a backup of your whole workstation that you can use to recover all your installed programs and local data if a hard drive fails.

Cloud

You can also use a service like dropbox, google drive to automatically back up your home directory, these services limit the space you can use, however they provide a remote backup and allow you to access files from anywhere.

Laptop

Your own laptop can be used to backup files you normally keep on the cloud or on a different server. Do this only for relatively small amount of data. If you want to keep extra copies of a code also using a shared server is fine. Do not use storage on disk at NCI or another server, or Nectar cloud resources to backup big amounts of data!! Storage is allocated to projects and it is shared on these servers and backup is not a good way to use this resource. 

Snapshots for VM instances

If you are managing cloud servers via NCI cloud or Nectar, take regular snapshots of your instances. It is not unlikely to bump into issues when updating a VM, and it is really useful to be able to restart your instance from a working state.

Version control

Version control is a must for your code and for any text files, as they change frequently, and you might want to recover older versions. It is not a good option for data files as they are usually binary. However, you can use it to keep track of changes to your data by using it for readme files describing your workflow. Here you can learn more about git and GitHub 

We provide a backup checklist on another wiki page for your convenience. Nectar also has an informative page on backup strategies.