Leaving the Centre guidelines

Revision as of 18:37, 30 March 2020 by C.carouge (talk | contribs)
Template:Stub This is a stub page and needs expansion

The goal of these guidelines is to make sure you do not leave your mess behind you for someone else to sort. It also helps identify data and code that others in the Centre might find useful to them.

The main message from these guidelines is:

You should not own any data on disk when you leave or shortly after you leave
Any data you own on tape should be accessible for read and write by other people

The Exit Email

About 2 months before you leave, you should receive the following email:

Dear xxxx,

Given that your contract is due to end in 2 months time, we ask that you start considering how data in your accounts will be managed in future. Specifically:
- please delete anything that is not required
- please send an email to the CMS team (copied on this email), detailing the nature of what will remain. This should outline major directories and explain:
	- what they contain 
	- why they’re important to keep
	- what is required to access them - which repository / scripts?
	- if this data is stored anywhere else
	- when can they be deleted (date)
	- where this data should be stored when your account is shut down (please consult with a CLEX CI)
	- who in CLEX should have access to this data?
	- please also let the CMS team if compute access will be required after your contract ends
 - please reply to both the admin and CMS teams with future contact details when you have them

The CMS team can help with tools and techniques to help you manage your data. Please understand that storage is extremely limited, so that a lack of response to the questions above will likely result in your data being archived as soon as your contract ends.

Regards,

The CLEX Admin Team

Sorting your data

You need to know what you will be required to do when you leave before you leave so you can prepare for it. Sorting through your files might take longer than you think.

You might need to discuss the following questions with your supervisor:

  1. Will you require access to the files after you leave (e.g. for a paper review)?
  2. What files will be useful to others? Should they be published?
  3. What files won't be useful to others? Should they be archived or deleted?

If you need specific advice on how to actually transfer or archive your files, you can always ask for assistance to our helpdesk: [1].

You require access to the files after you leave

Files at NCI

When you leave, you will keep your NCI credentials IF you keep your contact details up to date through https://my.nci.org.au

Make sure to quit all NCI projects you don't need access to anymore. You can be ruthless as it is quite easy to regain access to those projects later on if needed.

Some projects have strict license terms attached (e.g. access). Your membership might be revoked at any time after you leave the Centre if you haven't negotiated extended access to those projects. Please contact the Lead CI or the project manager to discuss your needs.

Files at your institution

Most universities will close your university and e-mail account. Often university data services are accessible only via your university account. If that is the case, you need to arrange access for yourself by contacting the IT services or the CI of projects that you used to deposit data before you leave. Specifics on university data services and advice on what happens when you leave can be found following the relevant link in the data services page.

Files that will be used by others

Publish the files

You should publish your files if the files are frequently used by several other persons. Ideally, that data should be published as soon as its usefulness to others is clear, not just when you leave.

If your files are at NCI, that is all you need to do. If your files are held at your institution, you may need to make sure there is a copy that is accessible by others and owned by someone who is likely to stay for years to come. This might depend on your institution data services.

Files not for publication

Some files might be useful to a small cluster of people but not be worth publishing. In that case, you simply need to change the ownership of the files: get the files to be owned by someone who stays after you.

If the files are at NCI, you can not change the files' ownership. You and the Lead CI of the project owning the files will need to contact help@nci.org.au so NCI staff can do it for you.

Files not useful anymore

It can be difficult to know what to keep and what to delete. A general rule might be to keep what is needed to reproduce your work and delete everything else. It gets complicated as this should be weighed against the cost, in money and time, of reproducing your work. This rule still allows us to clearly identify files that always need to be kept and files that never need to be kept.

To keep, always

Codes

There are 2 types of codes: codes distributed by others (e.g. climate models) and codes written by you.

For codes distributed by others, you simply need to keep a reference to the version used as long as no modification to the code was made by you. If you modified the code and this modification is part of a standard version of the model, you might be able to simply reference this version as long as it is exactly the version you used. If you modified the code for yourself only, this is now a code written by you and falls into the second category.

For codes written by you, the simplest is for you to save the code in a Git repository on Github, then to publish this repository via Zenodo. We can help with the publication. It is absolutely fine to create a repository per project or paper with all your codes in. You can then use the README file from the Github repository to explain how to reproduce your results. Don't forget to clearly reference everything one might need in addition to this repository. Or you can have a repository per code especially if you envision you'll reuse the same code for other work.

Configuration files for running the codes and some input files.

In addition to the codes themselves, you need to keep everything that enables someone to run the codes in the same way you have done so. Usually, the most complicated configurations are for climate models. Some climate models will save your configurations in version control repositories (e.g. UM, ACCESS, ACCESS-OM2), in which case you simply need to keep the information on how to retrieve these configurations. Some models don't save your configurations and you need to do it yourself.

For the input files, some inputs are published data in which case you need to keep the reference to this data (including the version). If you have written several codes, the output of a piece of code will be the input of the next piece of code, in which case you do not necessarily need to keep that data. But you need to keep the information on your workflow.

A description of your workflow

This can be a tricky one as there is not a one-size-fits-all format to save this information. It is fine to write a README file and archive it with other files from the project that need archiving. You can also have a special Github repository just for this README file, or a repository for all your projects with READMEs for each project. Whatever format you choose, it is important for this information to be publicly available (unless your project was restricted) and not a personal note.

This description should clearly describe step by step what someone should do to reproduce your work.

This information has to be kept for at least 5 years. The time requirements differ slightly depending on institutions and funding bodies.

To delete, always

You do not need to keep files that are not necessary to reproduce your work:

  • log files
  • failed experiments
  • temporary files such as created from successive cdo/nco commands.

To keep or to delete?

Climate model outputs. Those are reproducible at least. It might not be possible to get bitwise reproducibility if underlying libraries change or the machine you have used is retired. The problem here is this data is usually quite large so the cost of storing it is important but it is also time-consuming to reproduce it. That is where you need to discuss with your supervisor as the answer might vary depending on how likely someone else might find this data useful.

Additionally, it is worth considering how many restart files you need to keep (if any). In most cases, you may need to archive fewer restart files for the long term as when you are actively working on a project.

Output of lengthy processing. It might feel necessary to keep those files but we would argue it is usually not worth the cost of the storage unless there is a clear indication they will be used again soon. The first reason is that with current modern programming techniques a lot of lengthy processing can be shortened significantly. The second reason is the very real cost of the storage is not worth the very hypothetical time saved in the future.

Files transfer

If you need to transfer files to a different machine (for example from NCI to your University or personal computer):

  • use sftp, scp or rsync to transfer files securely ( rsync can be resumed )
  • if transferring from/to NCI:
**use the dedicated data-mover nodes if transferring from/to NCI: g-dm.nci.org.au 
**use copyq if you want to queue a job