Difference between revisions of "Best practices for directories and files"

m
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{Template:Working_on}} <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">The names you choose for files and directories and generally the way you organise your data, ie your directory structure (DRS) can help navigating the data and provide extra information, avoid confusion and the user ending up accessing the wrong data.</span></span> <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">As for many other cases the best file organisation will depend on the your specific research project and the actual server where the data is stored.</span></span><span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">&nbsp;here we are just listing a few guidelines and tips to help you decide.</span></span> &nbsp;
+
 
 +
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">The names you choose for files and directories and generally the way you organise your data, i.e. your directory structure (DRS) can help navigating the data and provide extra information, avoid confusion and the user ending up accessing the wrong data. As for many other cases the best file organisation will depend on the specific research project and the actual server where the data is stored.&nbsp;Here we are just listing a few guidelines and tips to help you decide.&nbsp;&nbsp;</span></span>
 +
 
 
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">General considerations</span></span>''' ===
 
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">General considerations</span></span>''' ===
&nbsp; <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">1) Familiarise yourself with the [[Storage|storage system]], make sure you are storing the files in the most&nbsp;appropriate place, get to know if the storage is backed up or not, what is your allocation, and also&nbsp;what&nbsp;rules or best practices apply.</span></span> &nbsp; <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">2) Take&nbsp;into account how yourself or others might want to use the data, this is particularly important when deciding the DRS but also how to divide data across files for big dataset as model output. Doing so at the start of the project will spare you lots of time you might otherwise spend re-processing all your files.</span></span> &nbsp; <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">3) Be consistent, this applies both to the organisation and the naming, consistency is essential for the data to be machine-readable, ie. data which is easy to access by coding. In fact use community standards and/or controlled vocabularies wherever possible.</span></span> &nbsp; <span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">4) Consider adding a [[Dataset_readme_file_template|readme file]] in the main directory (we always do that for data we publish), including an explanation of the DRS and the naming conventions, abbreviation and/or codes you used. If you used standards and controlled vocabularies&nbsp;all you have to do is to include a link to them. &nbsp;</span></span> &nbsp;  
+
 
 +
#&nbsp;<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Familiarise yourself with the [[Storage|storage system]], make sure you are storing the files in the most&nbsp;appropriate place, get to know if the storage is backed up or not, what is your allocation, and also&nbsp;what&nbsp;rules or best practices apply.</span></span>  
 +
#<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Take&nbsp;into account how yourself or others might want to use the data, this is particularly important when deciding the DRS but also how to divide data across files for big dataset as model output. Doing so at the start of the project will spare you lots of time you might otherwise spend re-processing all your files.</span></span>  
 +
#<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Be consistent, this applies both to the organisation and the naming, consistency is essential for the data to be machine-readable, i.e. data which is easy to access by coding. In fact, use community standards and/or controlled vocabularies wherever possible.</span></span>  
 +
#<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Consider adding a [[Dataset_readme_file_template|readme file]] in the main directory (we always do that for data we publish), including an explanation of the DRS and the naming conventions, abbreviation and/or codes you used. If you used standards and controlled vocabularies&nbsp;all you have to do is to include a link to them. &nbsp;</span></span> &nbsp;  
 +
 
 
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Naming</span></span>''' ===
 
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Naming</span></span>''' ===
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">You can use your filenames to include information here is some you can consider:</span></span>  
+
 
*project, simulation and/or&nbsp;experiment acronyms, you might have to use a combinations of them.  
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">You can use your filenames to include information here is some you can consider:</span></span>
*spatial coverage: the region or coordinates range covered by the data, could also be a specific domain for climate model data, ie.e ocean, land etc.  
+
 
*grid: could be either a grid label or spatial resolution  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">project, simulation and/or&nbsp;experiment acronyms, you might have to use a combination&nbsp;of them.</span></span>
*temporal coverage: a specific year/date or a temporal range  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">spatial coverage: the region or coordinates range covered by the data, could also be a specific domain for climate model data, i.e. ocean, land etc.</span></span>
*temporal frequency: monthly, daily etc  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">grid: could be either a grid label or spatial resolution</span></span>
*type of data: again this depends on context, if the same directory contains data from different instrumentations it is important to specify the instrument in the name, for coupled model output again this could be the model component, if you are using one file per variable then you should have the variable name  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">temporal coverage: a specific year/date or a temporal range</span></span>
*version: this is really important if you are sharing the data even if only 1 verison exists at the time  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">temporal frequency: monthly, daily etc</span></span>
*correct file extension  
+
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">type of data: again this depends on context, if the same directory contains data from different instrumentations it is important to specify the instrument in the name.&nbsp;For coupled model output this could be the model component, if you are using one file per variable&nbsp;the variable name.</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">version: this is really important if you are sharing the data even if only 1 version exists at the time</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">correct file extension</span></span>
  
 
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">DRS</span></span>''' ===
 
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">DRS</span></span>''' ===
  
&nbsp;
+
[[File:Example of directory structure.png|800px|Example of directory structure.png]]
 +
 
 +
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">The figure above shows an example of an organised working directory for a model output, things to consider:</span></span>
  
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">try to organise files in directories&nbsp;based on type and how you process them</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">for the final output think also about&nbsp;how other might use them:&nbsp;are they going to be used for analysis or they could be used as forcing or restart files for a model?&nbsp;</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">think of the way you would access these directories in a code, as an example having the variable directories using exactly the same name as the actual variable</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">make sure your code is separate from your data, you want to be able to use something like git to version control it and possibly GitHub to back it up easily</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">have at least one readme file with detailed metadata, possibly more if you have a lot of directories/files. You cannot use git for keep manage versions of&nbsp;data but you can use git to version control your readme files.</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">review at regular intervals what you are keeping, what needs to be removed and how things are organised</span></span>
  
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Tips to for machine-readable files</span></span>''' ===
+
 
 +
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Tips for machine-readable files</span></span>''' ===
  
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">avoid special characters: ~&nbsp;! @ # $&nbsp;% ^ & * ( ) `&nbsp;; < >&nbsp;? , [ ] { } ‘ “</span></span>  
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">avoid special characters: ~&nbsp;! @ # $&nbsp;% ^ & * ( ) `&nbsp;; < >&nbsp;? , [ ] { } ‘ “</span></span>  
Line 28: Line 46:
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">always include file extension, some software can recognise files from their header, but this is not always the case</span></span>
 
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">always include file extension, some software can recognise files from their header, but this is not always the case</span></span>
  
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Resources</span></span>''' ===
+
 
&nbsp; https://www.earthdatascience.org/courses/intro-to-earth-data-science/open-reproducible-science/get-started-open-reproducible-science/best-practices-for-organizing-open-reproducible-science/ &nbsp; https://libguides.princeton.edu/c.php?g=102546&p=930626 incldues video &nbsp; https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming - you can download the same as a pdf
+
=== '''<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Online Resources</span></span>''' ===
 +
 
 +
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">We partially based this page on&nbsp;the resources listed below, we recommend to check them for more insight and advice.</span></span>
 +
 
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">[https://www.earthdatascience.org/courses/intro-to-earth-data-science/open-reproducible-science/get-started-open-reproducible-science/best-practices-for-organizing-open-reproducible-science/ Best practice to organise your data] - part of an Open reproducible science course from the University of Colorado&nbsp;</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">[https://youtu.be/3MEJ38BO6Mo Software Carpentry video covering DRS best practices]</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">[https://stanford.box.com/shared/static/yl5a04udc7hff6a61rc0egmed8xol5yd.pdf Best file naming practice handout (pdf) from Standford University]&nbsp;</span></span>
 +
 
 +
[[Category:Data induction]]

Latest revision as of 21:58, 12 September 2021

The names you choose for files and directories and generally the way you organise your data, i.e. your directory structure (DRS) can help navigating the data and provide extra information, avoid confusion and the user ending up accessing the wrong data. As for many other cases the best file organisation will depend on the specific research project and the actual server where the data is stored. Here we are just listing a few guidelines and tips to help you decide.  

General considerations

  1.  Familiarise yourself with the storage system, make sure you are storing the files in the most appropriate place, get to know if the storage is backed up or not, what is your allocation, and also what rules or best practices apply.
  2. Take into account how yourself or others might want to use the data, this is particularly important when deciding the DRS but also how to divide data across files for big dataset as model output. Doing so at the start of the project will spare you lots of time you might otherwise spend re-processing all your files.
  3. Be consistent, this applies both to the organisation and the naming, consistency is essential for the data to be machine-readable, i.e. data which is easy to access by coding. In fact, use community standards and/or controlled vocabularies wherever possible.
  4. Consider adding a readme file in the main directory (we always do that for data we publish), including an explanation of the DRS and the naming conventions, abbreviation and/or codes you used. If you used standards and controlled vocabularies all you have to do is to include a link to them.    

Naming

You can use your filenames to include information here is some you can consider:

  • project, simulation and/or experiment acronyms, you might have to use a combination of them.
  • spatial coverage: the region or coordinates range covered by the data, could also be a specific domain for climate model data, i.e. ocean, land etc.
  • grid: could be either a grid label or spatial resolution
  • temporal coverage: a specific year/date or a temporal range
  • temporal frequency: monthly, daily etc
  • type of data: again this depends on context, if the same directory contains data from different instrumentations it is important to specify the instrument in the name. For coupled model output this could be the model component, if you are using one file per variable the variable name.
  • version: this is really important if you are sharing the data even if only 1 version exists at the time
  • correct file extension

DRS

Example of directory structure.png

The figure above shows an example of an organised working directory for a model output, things to consider:

  • try to organise files in directories based on type and how you process them
  • for the final output think also about how other might use them: are they going to be used for analysis or they could be used as forcing or restart files for a model? 
  • think of the way you would access these directories in a code, as an example having the variable directories using exactly the same name as the actual variable
  • make sure your code is separate from your data, you want to be able to use something like git to version control it and possibly GitHub to back it up easily
  • have at least one readme file with detailed metadata, possibly more if you have a lot of directories/files. You cannot use git for keep manage versions of data but you can use git to version control your readme files.
  • review at regular intervals what you are keeping, what needs to be removed and how things are organised


Tips for machine-readable files

  • avoid special characters: ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ “
  • do not use spaces to separate words use underscores "_" or dashes "-" or CamelCase
  • use YYYYMMDD for dates, it will sort your files in chronological order, absolutely avoid "Jan, Feb, .." for months as they are much harder to code for.
  • or number sequence use leading zeros: so 001, 002, 020, 103  rather than 1, 2,.. 20, .. 103
  • try to avoid overly long names, for a single file, directory keep it under 255 characters, for paths 30000.
  • avoid having a large number of files in a single directory … but also an excessive number of directories with one file each
  • always include file extension, some software can recognise files from their header, but this is not always the case


Online Resources

We partially based this page on the resources listed below, we recommend to check them for more insight and advice.