Data management tools


There are a lot of data services and new ones are generated each year for different discipline and use cases, I'm listing here a few that we currently use. Please note that I'm including here also tools which are useful to keep track of your data workflow and that code and software is included in our definition of data. 

ANDS and Research Data Australia

The Australian National Data Services was funded by the government in 2010 with the task to facilitate managing and sharing data in Australia. They collaborate with most Australian institutions producing data, including CLEx of which they are a partner. While they are involved in a wide range of projects all revolving around data and its management their main achievement is probably the Research Data Australia (RDA). metadata catalogue. Most universities and research centers across Australia are now listing their data collections on RDA. It's becoming a requirement and it is a useful tool to find out about data existence. It is also the repository where we list our datasets so they can have an online reference when we publish papers regardless of having being assigned a DOI or not. ARCCSS and CLEx have their own data source, and so do NCI and most universities.

Datasets listed by RDA are automatically added to the recently released google dataset search tool.

Geonetwork

NCI provides some data services too, they require a data management plan (DMP) for dataset and data collection hosted on raijin. The DMPs can be added and/or updated online they are then used to create metadata records which are displayed by the NCI geonetwork catalogue. Geonetwork, like RDA is a metadata catalogue, can be used to find what dataset are available at NCI both on the file system for internal use and online. Once a dataset has a geonetwork record NCI can mint upon request a DOI for the dataset. Both ARCCSS and CLEx have a collection of datasets on geonetwork which we use to publish any data output from our students and/or researchers which is not already part of a data collection. Ultimately there will be a geonetwork record, an RDA record and a DOI for each one of these.

THREDDS 

The THREDDS Data Server (TDS) is a web server that provides metadata and data access for scientific datasets, using a variety of remote data access protocols. A thredds server appears as a webpage listing the datsets as folders and files, once you get to file level for each on of them you have a poage wich lists all the services which have been enabled to access the data.

As well as donwloading the file via HTTP, you can get a subset and save it as a netcdf file (NeTCDF Subset Service), GIS services (WMS, WCS, WFS), some visualiazation tools and  OPeNDAP.

OPeNDAP is a widely used, subsetting data access method extending the HTTP protocol. OPenDAP can be with many analysis software used to access remote data as you would access a local netcdf file. We have a blog that demonstrates how to build and use and opendap url, other more in depth information on opendap is available from their website, including a list of softwares that understand this protocol.

NCI uses a Thredds server to make datasets available remotely. Our CLEX and ARCCSS collection are also available on this.

Digital object ( DOI ) and researcher identifiers

An identifier is any label used to name some thing uniquely (whether online or offline). URLs are an example of an identifier. So are serial numbers, and personal names. A persistentidentifier is guaranteed to be managed and kept up to date over a defined time period. You can now create persistent identifiers for data as you do for papers. NCI has capabilities to mint a DOI for you, if your data is hosted on their servers, we can facilitate this process by adding your dataset to one of our collections. In some cases your research output might be part of an international or bigger project (as CMIP5) which has its own arrangement for DOIs minting. More information is provided by ANDS which has put together some identifiers guidelines.

A more recent development are researcher identifiers, a unique identifier which you can add to anything which define you as a researcher: papers, blogs, published datasets, web-pages, code repositories etc. Again ANDS have some guidelines, particularly on the Open Researcher and Contributor ID (ORCID) which is the one they endorse. It is important to point out that the ARC is now allowing the use of ORCIDs or other researchers identifiers in grant applications. This is particularly important for all the students and researchers that spend a long time producing data (for example by running a model) and creating their own script for data analysis, because it allows them to show a more accurate and complete picture of their achievements than a simple publications list.

Universities services

All the universities part of CLEx have either adopted a data repository or they are working towards one. Depending on their rules as a student you might have to produce your own data management plan or use one common to all your research group. Mostly the final aim of these is again to get a record on RDA.

Publishing data guidelines from papers

We have a section of this wiki on publishing data for a journal paper: publishing guidelines | Principles of data sharing according to Nature publishers | elsevier data publication


DMPonline tool

There are online tools available to create a DMP, we are using one developed in the UK called DMPonline. We implemented our own version ARCCSS DMPonline to adapt it to the ARCCSS/CLEx needs:

Our DMP template for example includes information specific to the creation of an RDA record, working on NCI system or information on which data and software would you like us to support or provide training for. We also included a new functionality which allows you to request extra disk storage on raijin, if storage provided by your project is insufficient or to share data more easily with your collaborators. You can use the tool in many ways, to keep track of your work to share a DMP with your supervisors or collaborators, to write a DMP to attach to a grant proposal. Most importantly you can use it to familiarise yourself with data management practices currently used in Australia and in the climate community in particular. More information is available on the data publishing guidelines page. Why should you worry about this even if you don't work with NCI resources has been answered in the data FAQs

Workflows, standards, git, svn, bitbucket and software repositories

These are all tools that will help you managing your actual data files (and I'm including any script and code in this definition) in a more secure, reliable way. What they have in common is that they use versions and keep track of changes so they help you and possible users of your data to know for certain how the data has been created, processed and changed from the input to its latest version.

Some of these tools are embedded in the models, GUIs, software, web tools, input data that you might be using. However, there will be part of your workflow (ie the succession of steps you do to get from your input to your final product) that are fully in your hands and it is often up to you to make sure that the information automatically generated from one step will get linked to the next one. Recording all this information is known as provenance and it is probably the biggest challenge in making science reproducible.

There are some attempts and several approaches to produce a provenance tool, there is not currently any fool proof, easy tool that you can apply to any case. On the bright side there are a few tools which are already available to you. For example the "rose suite" to run the ACCESS model will keep track of which configuration you're using and save it to a database, where is available for any other user to retrieve.

An example fo workflow is the CWSlab pipeline developed by CSIRO. It was developed to make it easier to perform a series of tasks on CMIP datasets in a structured and reproducible way. As well as simplifying the access to the data the piepline saves a lot of information on which script you are using, which version of any software and any changes you did to the workflow itself.

Vistrail is a workflow tool freely available and already installed on the NCI remote desktop (VDI). It was first installed to provide a GUI for the pipeline, but can be used independently on its own.

Svn and git are used to keep track of codes evolution, including models. They also allow through online services like github and bitbucket to keep track of chnages in collaborative projects, where it would be really hard otherwise to keep track of different versions.

While adopting provenance principles might feel as an overwhelming task, any improvement is better than nothing, it is more about forming good habits so that these tasks become second nature. You can start for example using git to keep track of your own codes and share it with other users when you feel comfortable doing so. It will gently force you in a habit of versioning, back up and document your own scripts. You can start to use git (or svn if you prefer) on any server and look at linking your local repository to a shared github or bitbucket repository later on. For more on svn, git etc check the center induction page. Available software repositories which collect scripts you can use for your analysis and that are open to contribution is listed below: