Scratch file expiry

Revision as of 20:17, 2 May 2022 by C.carouge (talk | contribs)

On 17th May 2022, NCI is introducing an automatic system to purge unused data from the scratch filesystem.   Below we are giving some steps on how best to prepare for that change. We also give some information on some useful tools to prepare a good data workflow going forward.

Preparation

Your best course of action is to start preparing now to avoid having any useful data going into quarantine on the 17th May. Below are the steps we recommend you to follow:

  • Read the information provided by NCI
  • Clean up /g/data: delete what you can, archive to tape or outside NCI what you can.
  • Clean up /scratch: delete what you can, move to /g/data or tape or outside NCI if you need long term storage without accessing it.
  • Run “nci-file-expiry list-warnings -p <project> > expiry_warning_<project>.txt” for all the projects you are a member.
  • Check the output of nci-file-expiry. If you identify anything here that is important and at risk of deletion, decide if it should be put on /g/data or tape instead.
  • Run “nci-file-expiry list-warnings” : this will catch any file you own in any project you may have forgotten about and decide what to do with anything that appear there.
  • Rethink your data pipeline: do not leave data you are not using anymore in /scratch. Decide what you need to do with it when you stop using it: delete, move to /g/data or outside NCI or archive to tape.
  • Manage your data under /g/data regularly: review your data, archive to tape or outside NCI as necessary.

Additional information

Below is a list of resources you might find useful to prepare for the automatic file expiry and build a good data workflow for the future.

Archiving data at NCI

Blog on building a sustainable data workflow

Description of the various filesystems at NCI

Long term strategy

  1. Keep using /scratch. There is not enough disk space if nobody is using /scratch. Your data is safe on /scratch as long as you access it, i.e. read, modify, create. Only the data you won't use for some time needs to be managed.
  2. Managing your data does not mean storing everything on /g/data. You need to think about what future use you have for that data. If you don't need it anymore and it can be reproduced, delete it. If you will need to publish the data, publish it now. It is a lot easier to publish a new version if you need a small change than to publish the initial version. If you can't publish now but won't need it for a long time, put the data on tape. If you won't need this data for a short time but you know you will get back to it soon, put it on /g/data
  3. Learn how to use the tape system. See our blog.
  4. Make managing your data a periodic, frequent task in your calendar. This includes all data: /scratch, /g/data and tape.