Data induction

Revision as of 00:51, 5 December 2019 by C.carouge (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This is an outline of the student data induction, but it can be useful to anyone who wants to get an overview of data management in our center.

We will cover three questions around data management:

  • Why data management?
  • Which data does it apply to?
  • How to manage your data: guidelines and tools.

Why should I have a management strategy for my data output?

There are quite a few reasons why it is a good idea to have a data management strategy in place for your PhD or research project. The main reason is that it is in your own best interest to do so, no matter how little of it you will end up publishing, could only be the metadata (i.e. the information on the data), it could be a subset, anything is better than nothing. It will make your research more visible and there are published studies show how paper accompanied by published data increase your chances of being cited. You might be required to do so by your institution and most publishers want the data to be published when you submit a paper. Most importantly it just simply helps you understanding what you are trying to achieve and how.

To review some of this reasons in more detail: why should I manage my data section.

Which data does this apply to?

It is difficult to give a straight answer to this question, it depends on what you are trying to achieve. One of the main goal you will try to achieve is to share and publish the data, so it is good to analyse "which data" keeping this into mind. Ideally you should share all your data, but this is not often practical and in some cases not even useful. Then there are different opinions on what data output means, should you include data you used as input or only your own original products? Should you share the codes you used to produce your data as well? Finally there are issues with size, it is easy to share and make available a dataset of few Gb size and a challenge sharing a model output measured in Tb.

While there is not a ready-made answer to this there are guidelines and principles to help you formulate your own answer. A few are listed in our which data should I publish section.


There are a lots of tools currently available to manage and publish your data, which one you choose use will depend on your project, your research group, the server you are using for your analysis etc. There are some tools though that are widespread in the climate community or that have been adopted by CLEx or your own institutions. As well as, no matter which versioning tool you choose or which catalogue you will use to publish your metadata and/or data the underline principles are the same. We are reviewing some of them in our data management tools section. All these tool have either as goal or as a starting point a Data Management Plan .

Once you are familiar with these concepts and ready to publish your data you can refer to this data publishing guidelines to see how to actually publish your data using the NCI services.