Best practices for directories and files

Revision as of 03:28, 30 June 2021 by P.petrelli (talk | contribs)
Template:Working on New page under construction

The names you choose for files and directories and generally the way you organise your data, ie your directory structure (DRS) can help navigating the data and provide extra information, avoid confusion and the user ending up accessing the wrong data. As for many other cases the best file organisation will depend on the your specific research project and the actual server where the data is stored. here we are just listing a few guidelines and tips to help you decide.  

General considerations

  1.  Familiarise yourself with the storage system, make sure you are storing the files in the most appropriate place, get to know if the storage is backed up or not, what is your allocation, and also what rules or best practices apply.
  2. Take into account how yourself or others might want to use the data, this is particularly important when deciding the DRS but also how to divide data across files for big dataset as model output. Doing so at the start of the project will spare you lots of time you might otherwise spend re-processing all your files.
  3. Be consistent, this applies both to the organisation and the naming, consistency is essential for the data to be machine-readable, ie. data which is easy to access by coding. In fact use community standards and/or controlled vocabularies wherever possible.
  4. Consider adding a readme file in the main directory (we always do that for data we publish), including an explanation of the DRS and the naming conventions, abbreviation and/or codes you used. If you used standards and controlled vocabularies all you have to do is to include a link to them.    

Naming

You can use your filenames to include information here is some you can consider:

  • project, simulation and/or experiment acronyms, you might have to use a combinations of them.
  • spatial coverage: the region or coordinates range covered by the data, could also be a specific domain for climate model data, ie.e ocean, land etc.
  • grid: could be either a grid label or spatial resolution
  • temporal coverage: a specific year/date or a temporal range
  • temporal frequency: monthly, daily etc
  • type of data: again this depends on context, if the same directory contains data from different instrumentations it is important to specify the instrument in the name, for coupled model output again this could be the model component, if you are using one file per variable then you should have the variable name
  • version: this is really important if you are sharing the data even if only 1 verison exists at the time
  • correct file extension

DRS

 


Tips to for machine-readable files

  • avoid special characters: ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ “
  • do not use spaces to separate words use underscores "_" or dashes "-" or CamelCase
  • use YYYYMMDD for dates, it will sort your files in chronological order, absolutely avoid "Jan, Feb, .." for months as they are much harder to code for.
  • or number sequence use leading zeros: so 001, 002, 020, 103  rather than 1, 2,.. 20, .. 103
  • try to avoid overly long names, for a single file, directory keep it under 255 characters, for paths 30000.
  • avoid having a large number of files in a single directory … but also an excessive number of directories with one file each
  • always include file extension, some software can recognise files from their header, but this is not always the case

Resources

  https://www.earthdatascience.org/courses/intro-to-earth-data-science/open-reproducible-science/get-started-open-reproducible-science/best-practices-for-organizing-open-reproducible-science/   https://libguides.princeton.edu/c.php?g=102546&p=930626 incldues video   https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming - you can download the same as a pdf