A large amount of a researcher's time is spent in the production and manipulation of data. No technology is perfect, and sometimes disasters can happen, from a fire destroying your computer, a hard drive failing or just deleting the wrong directory. No-one wants to lose all of their data; results can be costly or even impossible to reproduce. It's important to have some idea of how you could recover if something were to happen to your files. Consider also situations in which the data is not irretrievably lost but might not be available for a while, this could be an issue if you have a deadline to meet.
Have a recovery strategy worked out
State how often the data will be backed up and to which locations. How many copies are being made? Storing data on laptops, computer hard drives or external storage devices alone is very risky. The use of robust, managed storage provided by university IT teams is preferable. Similarly, it is normally better to use automatic backup services provided by IT Services than rely on manual processes.
What is essential?
The first thing to think about for backups is which part of your data is essential to have, especially when you have a lot of it. You may want to preserve output files from a very long simulation, but not care about the log files once you have verified your results. Or perhaps you have collected observations, and these cannot be reproduced. Particularly if they are observations of a rare phenomenon. The scripts to create all the plots used in your latest paper can probably be reproduced but with lots of time and effort.
Also think about who else might be making use of your files, are other people using your results in their own research projects?
What are your vulnerable points?
Is your laptop old or has been showing sign of potential issues?
Did you back up results on an external hard drive years ago? Remember technology can become obsolete and disk deteriorates with time
If you are you using files managed by others, make sure you know how they are managed. If they are likely to be removed and if they are backed up.
How big is your data?
The next thing to consider is how big your data is. Your home directory can easily be backed up to a portable hard drive or to Dropbox, this is not really an option when you have terabytes of model output. However, supercomputer providers use a tape data archive, known as MDSS at NCI, to archive large files. Most institutions also have their own archives for important data created by their researchers.
How often do you update your data?
You should also think about how often the data changes. Codes change often and using a version control system is the only way to really keep track of all the versions. Model output or recorded observations are unlikely to change, you might need to back them up only once. Post processed output will change more frequently while you are still conducting your analysis, it might be worth to have some level of automation and back up your files every few days. University IT services might be your best option, depending on the file sizes. Version control is not a good option for data files as they are usually binary. However, you can use it to keep track of readme files describing your workflow and backup these instead.
Characteristic of different backup options
Not all backup solutions provide the same protection. Creating a copy of a file in a separate directory on your computer provides a measure of protection against accidental deletion, however it could be lost if your hard drive fails. On the other end of the scale you could store copies of your data on archival tapes in different cities, so that it could still be recoverable in the case of natural disasters. Generally, the latter is costly for a large data set, the amount of protection needed should be balanced against the value of your data.
Knowing all of this, what are the best options to use?
The first thing to do is to protect your laptop/workstation. Check what is already being backed up by your local IT support, they may have an existing backup strategy. For instance NCI backups user home directories on their servers. Get an external hard drive at least as big as is in your computer, then either set up a cron job to automatically rsync your internal and external hard drives on linux or enable time machine if you use a mac. This gets you a backup of your whole workstation that you can use to recover all your installed programs and local data if a hard drive fails. You can also use a service like dropbox to automatically back up your home directory, these services limit the space you can use, however they provide a remote backup and allow you to access files from anywhere.
If you are managing cloud servers via NCI cloud or Nectar, take regular snapshots of your instances. It is not unlikely to bump into issues when updating a VM, and it is really useful to be able to restart your instance from a working state.
Ask the CMS team or your institution's data services about archiving services available to store large data sets. These services have their own data management plans in place to ensure the integrity of data, you may need to provide a data management plan describing the value of the data & how long it needs to be stored for.