Why should I care?

Data Publication in CLEX: A Quick Guide for University Researchers

by Ian Macadam

Why should we publish data?

This is an obvious question for any researcher of the current and previous generation to ask. However, I suggest that a better question to ask would be “Why aren’t we all routinely publishing our data?”. By the time CLEX “closes its doors” in 2024, I expect (and hope) that an academic asking “Why should we publish data?” be greeted with the same number of raised eyebrows as one asking “Why should we publish papers?”.

I once heard a professor say “research that has not been published does not exist”. He is, of course, wrong – there is plenty of unpublished research undertaken by governments, private corporations and universities (and in many cases there are some extremely good reasons why this research is not published!). However, the quote is a good one in the context of CLEX academics. In this context, we publish our research to let the world know it exists, thus gaining credit for our work and allowing others to reproduce, test and build on it. These objectives are partially served by publishing scientific papers. However, our research ≠our papers. An ever-greater portion of our research takes the form of data and computer code that, in most cases, cannot easily be reproduced just by reading the relevant paper (e.g. it’s kind of a nuisance to recode and rerun all the CMIP6 models from scratch on the basis of a 4-page Naturepaper to see if the authors’ conclusions about ice albedo effects stack up). More rigorous processes for publication are therefore being increasingly applied to data and software.

What is published data?

So what is published data? A useful definition for published data is data that has been assigned a Digital Object Identifier (DOI) which is visible, with at least a brief description of the dataset, via at least one relevant online repository. This definition does not necessarily mean that a dataset adheres to the FAIR Data Principles but it will be increasingly difficult to claim that a dataset is FAIR if it does not meet this definition.

Why should I publish data?

Good practice in science is rapidly evolving around data. Today, it would be bizarre, to say the least, not to have a DOI assigned to a scientific paper. Publishers assign DOIs to papers so that they are easy to reference and so that computer systems can easily track citations. It is rapidly becoming the norm for DOIs to be assigned to datasets. In the future, datasets that researchers produce will be less likely to be used if they do not have DOIs and those researchers who do have DOIs assigned to their datasets will receive extra credit through citations of their data. On a more basic level, it may not be possible to publish a paper in some journals if the data underlying the analysis is not available to readers.

Of course, having a DOI assigned to a dataset is no use if the DOI and a description of the dataset is not visible to those with a potential interest in the data, so an important part of data publication is that potential users can find the data in an appropriate online repository. The analogy here is assigning a DOI to a scientific paper but then not having it included in a journal!

When should I NOT publish my data?

It is not appropriate to publish all data. There are number of cases where one would not want to publish a dataset:

  1. The data are “intermediate” or “working” data produced as an intermediate step to your final results and are not critical to the reproducibility of the final results and, often, are subject to change as you work on your methods.
  2. Publishing the data now would allow other scientists to analyse it and publish key conclusions before you have completed your own analysis.
  3. You have derived the data using underlying data or methods that do not allow you to publish the data.
  4. Publishing the data would prejudice a patent application or give away IP that has commercial value.
  5. The data contains sensitive personal information (e.g. health records for individuals).

Of course, many of these reasons depend on timing. It may not be a good idea to publish your data this summer, before you have completed your analysis or applied for a patent, but fine to do so next summer after your paper is published and your patent application has been submitted.