Difference between revisions of "Data FAQ"

Line 1: Line 1:
[[Category: Data]][[Category:FAQ]]
 
  
 +
=== <span style="color:#2980b9;">'''Where should I publish my data?'''</span> ===
  
==='''<span style="color: #1e49e0;">Why the CMIP5 data on Raijin has such a complicated directory structure?</span>'''===
+
You usually have at least three&nbsp;options and there's not a straight answer it depends on what you're publishing and why.
  
The CMIP5 dataset is overall a Pb sized dataset contributed by as many as 60 different modelling groups, hence it is inherently complicated to organise. As well because the modelling groups didn’t stick to the rules and did their own thing. Unfortunately, the guidelines in regard to versioning the dataset were not sufficiently detailed and so they've been interpreted differently by different groups. When the climate community started downloading data on Raijin it was decided that the only way to keep track of the dataset "version" was to re-create their DRS as in the web server (thredds) which are unique. We also always download the originally published dataset and no replicas from other nodes.
+
<u>CLEx data collection on NCI</u>: this is the best place if you have actually produced the data yourself. We'll help you document&nbsp;your data and make it user friendly, it will be part of a climate data collection and so it will be easier to discover. NCI also has more storage capacity that other repositories and services which are designed around the netcdf format.
When a web server reaches its capacity then the new datasets are published from a new server and this means that you could have a different root for the same model, sometimes for the same experiment.
 
Currently NCI is re-downloading the latest versions of CMIP5 non-Australian data into a more coherent directory structure. the new replicated data is stored in the al33 group. Refer to their [https://opus.nci.org.au/display/CMIP/CMIP+Community+Home | climate community page] for information and updates.
 
  
===<span style="color: #1e49e0;">Do CMIP5 variables coming from the same simulations have the same version number?</span>===
+
<u>Institutional repository</u>: this can be adeguate if you have a small dataset and it is really specific to your study or only a subset/post processing of another dataset. While institutions offer some data curation they usually won't check the data&nbsp;is well described, consistent and user friendly, so you will get a doi but no added value.
  
<span class="s1">This question is difficult to answer and it is a really sore point with CMIP5, hopefully, they are implementing changes so it won't happen again in CMIP6.</span>
+
<u>Zenodo:</u> it's really tempting to use zenodo and other similar services, you can publish your data very quickly without needing approval from anyone. The downside of this is that while you get a doi your data is not part of an actual repository, no one checked that is sufficiently described or that even the attributes offered by the service are properly used. The data size is limited and you won't get any additional data services apart from http download.&nbsp;This should be your last resort, still if you go this way please make sure you document your data properly.
<span class="s1">There's no way to be 100% sure that two versions come out of the same simulation. The versioning instructions where quite unclear and interpreted differently by different modelling groups. In the last couple of years, it occurred to few groups (for example GFDL) to add a "simulation_id" to their attributes but this is the exception rather than the norm. I'd like to assume that same version means the same simulation, but having a different version really just means that the group of variables has been published later, or maybe that one of them was calculated wrongly in the post-processing and has been re-published under a new version.</span>
 
<span class="s1">A completely different simulation with a different configuration, initialization etc should have a different ensemble code so anything with r1i1p1, for example, should come from the same run, even if part of it or its post-processing might have been updated.</span>
 
<span class="s1">More information might be available directly from the modelling groups.</span>
 
<span class="s1">Another cautionary approach could be if you find a few of the variables you need which have a more recent version, you can send an e-mail to [[mailto:climate_help@nci.org.au | climate_help]] and we will check if everything you need it is up to date. Users request only what they need and it's very well possible that someone updated just part of an ensemble.</span>
 
  
===<span style="color: #1e49e0;">Using the ARCCSS DMPonline is not useful for me as I don't use NCI servers.</span>===
+
If you're confused feel free to ask us we are always happy to provide advice and support and whatever you choose to do remember to report the title and doi on Clever.
  
<span class="s1">Using DMPonline is really independent of NCI, while it contains some specific information on NCI systems since they are common to many in the Centre, it is more about data management in general, regardless which hardware you use. Even if you are using your laptop it is good to have a data workflow, a plan of what you will be doing at different stages of your research. For example, to publish an RDA record for your data with us, you would now fill one of these plans, and the advantage would be that you can easily export that as a document which you can always adapt and re-use.</span>
+
=== '''<span style="color: #1e49e0;">Why the CMIP5 data on Raijin has such a complicated directory structure?</span>''' ===
 +
 
 +
The CMIP5 dataset is overall a Pb sized dataset contributed by as many as 60 different modelling groups, hence it is inherently complicated to organise. As well because the modelling groups didn’t stick to the rules and did their own thing. Unfortunately, the guidelines in regard to versioning the dataset were not sufficiently detailed and so they've been interpreted differently by different groups. When the climate community started downloading data on Raijin it was decided that the only way to keep track of the dataset "version" was to re-create their DRS as in the web server (thredds) which are unique. We also always download the originally published dataset and no replicas from other nodes. When a web server reaches its capacity then the new datasets are published from a new server and this means that you could have a different root for the same model, sometimes for the same experiment. Currently NCI is re-downloading the latest versions of CMIP5 non-Australian data into a more coherent directory structure. the new replicated data is stored in the al33 group. Refer to their [https://opus.nci.org.au/display/CMIP/CMIP+Community+Home climate community page] for information and updates.
 +
 
 +
=== <span style="color: #1e49e0;">Do CMIP5 variables coming from the same simulations have the same version number?</span> ===
 +
 
 +
<span class="s1">This question is difficult to answer and it is a really sore point with CMIP5, hopefully, they are implementing changes so it won't happen again in CMIP6.</span> <span class="s1">There's no way to be 100% sure that two versions come out of the same simulation. The versioning instructions where quite unclear and interpreted differently by different modelling groups. In the last couple of years, it occurred to few groups (for example GFDL) to add a "simulation_id" to their attributes but this is the exception rather than the norm. I'd like to assume that same version means the same simulation, but having a different version really just means that the group of variables has been published later, or maybe that one of them was calculated wrongly in the post-processing and has been re-published under a new version.</span> <span class="s1">A completely different simulation with a different configuration, initialization etc should have a different ensemble code so anything with r1i1p1, for example, should come from the same run, even if part of it or its post-processing might have been updated.</span> <span class="s1">More information might be available directly from the modelling groups.</span> <span class="s1">Another cautionary approach could be if you find a few of the variables you need which have a more recent version, you can send an e-mail to [[mailto:climate_help@nci.org.au| climate_help]] and we will check if everything you need it is up to date. Users request only what they need and it's very well possible that someone updated just part of an ensemble.</span>
 +
 
 +
=== <span style="color: #1e49e0;">Using the CLEx Roadmap data management tool is not useful for me as I don't use NCI servers.</span> ===
 +
 
 +
<span class="s1">Using Roadmap is really independent of NCI, while it contains some specific information on NCI systems since they are common to many in the Centre, it is more about data management in general, regardless which hardware you use. Even if you are using your laptop it is good to have a data workflow, a plan of what you will be doing at different stages of your research. For example, to publish an RDA record for your data with us, you would now fill one of these plans, and the advantage would be that you can easily export that as a document which you can always adapt and re-use.</span>
  
 
<span class="s1">DMP will be compulsory for universities and ARC grants, publishing your data is already compulsory for most journals. Plus, at CMS we really want to hear from users that are not using NCI, users we don't normally hear from. So we get a better idea of what everybody in the Centre is doing and we can create new training resources or support all our users in a better way.</span>
 
<span class="s1">DMP will be compulsory for universities and ARC grants, publishing your data is already compulsory for most journals. Plus, at CMS we really want to hear from users that are not using NCI, users we don't normally hear from. So we get a better idea of what everybody in the Centre is doing and we can create new training resources or support all our users in a better way.</span>
 +
 +
[[Category:Data]] [[Category:FAQ]]

Revision as of 17:03, 12 December 2019

Where should I publish my data?

You usually have at least three options and there's not a straight answer it depends on what you're publishing and why.

CLEx data collection on NCI: this is the best place if you have actually produced the data yourself. We'll help you document your data and make it user friendly, it will be part of a climate data collection and so it will be easier to discover. NCI also has more storage capacity that other repositories and services which are designed around the netcdf format.

Institutional repository: this can be adeguate if you have a small dataset and it is really specific to your study or only a subset/post processing of another dataset. While institutions offer some data curation they usually won't check the data is well described, consistent and user friendly, so you will get a doi but no added value.

Zenodo: it's really tempting to use zenodo and other similar services, you can publish your data very quickly without needing approval from anyone. The downside of this is that while you get a doi your data is not part of an actual repository, no one checked that is sufficiently described or that even the attributes offered by the service are properly used. The data size is limited and you won't get any additional data services apart from http download. This should be your last resort, still if you go this way please make sure you document your data properly.

If you're confused feel free to ask us we are always happy to provide advice and support and whatever you choose to do remember to report the title and doi on Clever.

Why the CMIP5 data on Raijin has such a complicated directory structure?

The CMIP5 dataset is overall a Pb sized dataset contributed by as many as 60 different modelling groups, hence it is inherently complicated to organise. As well because the modelling groups didn’t stick to the rules and did their own thing. Unfortunately, the guidelines in regard to versioning the dataset were not sufficiently detailed and so they've been interpreted differently by different groups. When the climate community started downloading data on Raijin it was decided that the only way to keep track of the dataset "version" was to re-create their DRS as in the web server (thredds) which are unique. We also always download the originally published dataset and no replicas from other nodes. When a web server reaches its capacity then the new datasets are published from a new server and this means that you could have a different root for the same model, sometimes for the same experiment. Currently NCI is re-downloading the latest versions of CMIP5 non-Australian data into a more coherent directory structure. the new replicated data is stored in the al33 group. Refer to their climate community page for information and updates.

Do CMIP5 variables coming from the same simulations have the same version number?

This question is difficult to answer and it is a really sore point with CMIP5, hopefully, they are implementing changes so it won't happen again in CMIP6. There's no way to be 100% sure that two versions come out of the same simulation. The versioning instructions where quite unclear and interpreted differently by different modelling groups. In the last couple of years, it occurred to few groups (for example GFDL) to add a "simulation_id" to their attributes but this is the exception rather than the norm. I'd like to assume that same version means the same simulation, but having a different version really just means that the group of variables has been published later, or maybe that one of them was calculated wrongly in the post-processing and has been re-published under a new version. A completely different simulation with a different configuration, initialization etc should have a different ensemble code so anything with r1i1p1, for example, should come from the same run, even if part of it or its post-processing might have been updated. More information might be available directly from the modelling groups. Another cautionary approach could be if you find a few of the variables you need which have a more recent version, you can send an e-mail to [climate_help] and we will check if everything you need it is up to date. Users request only what they need and it's very well possible that someone updated just part of an ensemble.

Using the CLEx Roadmap data management tool is not useful for me as I don't use NCI servers.

Using Roadmap is really independent of NCI, while it contains some specific information on NCI systems since they are common to many in the Centre, it is more about data management in general, regardless which hardware you use. Even if you are using your laptop it is good to have a data workflow, a plan of what you will be doing at different stages of your research. For example, to publish an RDA record for your data with us, you would now fill one of these plans, and the advantage would be that you can easily export that as a document which you can always adapt and re-use.

DMP will be compulsory for universities and ARC grants, publishing your data is already compulsory for most journals. Plus, at CMS we really want to hear from users that are not using NCI, users we don't normally hear from. So we get a better idea of what everybody in the Centre is doing and we can create new training resources or support all our users in a better way.