Difference between revisions of "Controlled vocabularies"

 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
  
{{Template:Working on}}  
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">A [https://www.ands.org.au/guides/vocabularies-and-research-data controlled vocabulary]&nbsp;is an agreed list of terms definitions used to provide a unique&nbsp;label to a concept. Controlled vocabularies are usually discipline related;&nbsp;their main aim is to facilitate&nbsp;sharing of data in the same community. For this reason, it is important that the community participate in the development of the vocabulary and agrees to its adoption for this&nbsp;to be useful.</span></span>
 
 
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">A [https://www.ands.org.au/guides/vocabularies-and-research-data controlled vocabulary]&nbsp;is an agreed list of terms definitions used to provide a unique&nbsp;label to a concept. Controlled vocabularies are usually discipline related;&nbsp;their main aim is to facilitate&nbsp;sharing of data in the same community. For this reason, is important that the community participate in the development of the vocabulary and agrees to its adoption for them to be useful.</span></span>
 
  
 
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">In some case vocabularies have been created in relation to a specific project, and then more widely adopted.&nbsp;As an example, since&nbsp;CMIP&nbsp;is&nbsp;an intercomparison project with modelling groups participating from across the world, it was essential to its success&nbsp;to define and&nbsp;use&nbsp;controlled vocabularies.&nbsp;[https://github.com/WCRP-CMIP/CMIP6_CVs CMIP6 controlled vocabularies] cover many different aspects: experiments, variables, realms, models, sub-projects,&nbsp;frequency,&nbsp;resolution and grid labels. Their definition and labels for variables, frequency and realms are&nbsp;often adopted by other&nbsp;climate data producers.</span></span>
 
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">In some case vocabularies have been created in relation to a specific project, and then more widely adopted.&nbsp;As an example, since&nbsp;CMIP&nbsp;is&nbsp;an intercomparison project with modelling groups participating from across the world, it was essential to its success&nbsp;to define and&nbsp;use&nbsp;controlled vocabularies.&nbsp;[https://github.com/WCRP-CMIP/CMIP6_CVs CMIP6 controlled vocabularies] cover many different aspects: experiments, variables, realms, models, sub-projects,&nbsp;frequency,&nbsp;resolution and grid labels. Their definition and labels for variables, frequency and realms are&nbsp;often adopted by other&nbsp;climate data producers.</span></span>
Line 12: Line 10:
 
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Controlled vocabularies also provide&nbsp;keywords to use when publishing&nbsp;data. Keywords are a powerful instrument when used properly. They can greatly increase the discoverability of a dataset, which is why&nbsp;it is one of the few highly recommended attributes in the [[Conventions|ACDD conventions]]. Unfortunately, there is not yet an agreed controlled vocabulary to be used specifically for climate science. Lots of climate terms are however covered by the [https://earthdata.nasa.gov/earth-observation-data/find-data/idn/gcmd-keywords Global Change Master Directory Keywords], maintained by NASA.</span></span>
 
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">Controlled vocabularies also provide&nbsp;keywords to use when publishing&nbsp;data. Keywords are a powerful instrument when used properly. They can greatly increase the discoverability of a dataset, which is why&nbsp;it is one of the few highly recommended attributes in the [[Conventions|ACDD conventions]]. Unfortunately, there is not yet an agreed controlled vocabulary to be used specifically for climate science. Lots of climate terms are however covered by the [https://earthdata.nasa.gov/earth-observation-data/find-data/idn/gcmd-keywords Global Change Master Directory Keywords], maintained by NASA.</span></span>
  
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">There are other terms you can use as keywords:</span></span>
+
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">There are few categories you should try to cover when assigning&nbsp;keywords:</span></span>
 +
 
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">dataset acronym: if your data is strictly related to another dataset, or your code is applied to a specific dataset</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">model acronym and version: as for datasets if you generated the data using a model</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">project acronym: if your dataset and or code relates to a specific project&nbsp;</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">programming language: &nbsp;you should add this to your code records and be as specific as possible, for example use&nbsp;python3, rather than just python</span></span>
 +
*<font face="Arial, Helvetica, sans-serif" size="3">data type: observation, model output, etc.</font>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">realm or discipline: like ocean, land and/or physical ocanography, climate science etc. For the disciplines&nbsp;you can use the [[Field_of_Research_codes|Fields&nbsp;of&nbsp;Research codes]]&nbsp;from the Bureau of statistics</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">variable names: if you have many just list the more relevant</span></span>
 +
*<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">spatiotemporal characteristic of the data: frequency, resolution, region covered</span></span>
 +
 
 +
<font face="Arial, Helvetica, sans-serif" size="3">Every time you define a keyword you should favour terms provided in a vocabulary, the GCMD keywords for example will cover most of the categories listed above. If you are using a speciifc name as for datasets, models and projects, then use the official acronyms and specify the versions whenever possible.</font>
  
*datasets acronym
+
<font face="Arial, Helvetica, sans-serif" size="3">Also remember that if a portal has a free text search any word in your&nbsp;title will be also used as a keyword, which is why it is useful to have a [[Descriptive_title|descriptive&nbsp;title]] for your dataset or code.</font>
*language name with version for code records
 
*...  
 
*<br/> &nbsp;
 
  
 
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Research Vocabulary Australia'''</span></span>
 
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">'''Research Vocabulary Australia'''</span></span>
Line 23: Line 29:
 
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">ARDC manage a controlled vocabulary service [https://vocabs.ardc.edu.au Research Vocabulary Australia]&nbsp;(RVA) to&nbsp;list</span>&nbsp;vocabularies used by Australian research community. As well as making it easier to find controlled vocabularies, RVA&nbsp;also allows research organisation to contribute and publish new vocabularies.&nbsp;</span>
 
<span style="font-size:medium;"><span style="font-family:Arial,Helvetica,sans-serif;">ARDC manage a controlled vocabulary service [https://vocabs.ardc.edu.au Research Vocabulary Australia]&nbsp;(RVA) to&nbsp;list</span>&nbsp;vocabularies used by Australian research community. As well as making it easier to find controlled vocabularies, RVA&nbsp;also allows research organisation to contribute and publish new vocabularies.&nbsp;</span>
  
&nbsp;
+
[[Category:Data induction]]
 
 
<span style="font-size:medium;">Should this be in data tools???</span>
 
 
 
&nbsp;
 
 
 
[[Category:Data]] [[Category:Data induction]]
 

Latest revision as of 02:13, 30 July 2021

A controlled vocabulary is an agreed list of terms definitions used to provide a unique label to a concept. Controlled vocabularies are usually discipline related; their main aim is to facilitate sharing of data in the same community. For this reason, it is important that the community participate in the development of the vocabulary and agrees to its adoption for this to be useful.

In some case vocabularies have been created in relation to a specific project, and then more widely adopted. As an example, since CMIP is an intercomparison project with modelling groups participating from across the world, it was essential to its success to define and use controlled vocabularies. CMIP6 controlled vocabularies cover many different aspects: experiments, variables, realms, models, sub-projects, frequency, resolution and grid labels. Their definition and labels for variables, frequency and realms are often adopted by other climate data producers.

Another example of controlled vocabulary is the CF conventions standard_name table, anyone can contribute by proposing a definition for variable which are not yet covered.

Keywords

Controlled vocabularies also provide keywords to use when publishing data. Keywords are a powerful instrument when used properly. They can greatly increase the discoverability of a dataset, which is why it is one of the few highly recommended attributes in the ACDD conventions. Unfortunately, there is not yet an agreed controlled vocabulary to be used specifically for climate science. Lots of climate terms are however covered by the Global Change Master Directory Keywords, maintained by NASA.

There are few categories you should try to cover when assigning keywords:

  • dataset acronym: if your data is strictly related to another dataset, or your code is applied to a specific dataset
  • model acronym and version: as for datasets if you generated the data using a model
  • project acronym: if your dataset and or code relates to a specific project 
  • programming language:  you should add this to your code records and be as specific as possible, for example use python3, rather than just python
  • data type: observation, model output, etc.
  • realm or discipline: like ocean, land and/or physical ocanography, climate science etc. For the disciplines you can use the Fields of Research codes from the Bureau of statistics
  • variable names: if you have many just list the more relevant
  • spatiotemporal characteristic of the data: frequency, resolution, region covered

Every time you define a keyword you should favour terms provided in a vocabulary, the GCMD keywords for example will cover most of the categories listed above. If you are using a speciifc name as for datasets, models and projects, then use the official acronyms and specify the versions whenever possible.

Also remember that if a portal has a free text search any word in your title will be also used as a keyword, which is why it is useful to have a descriptive title for your dataset or code.

Research Vocabulary Australia

ARDC manage a controlled vocabulary service Research Vocabulary Australia (RVA) to list vocabularies used by Australian research community. As well as making it easier to find controlled vocabularies, RVA also allows research organisation to contribute and publish new vocabularies.