5 Metadata

Data is hardly informative – it may be a page in a book, a file in an obsolete file format on a governmental server, an Excel sheet that you do not remember to have checked for updates. Most data are useless, because we do not know how it can inform us, or we do not know if we can trust it. This is why “Data is potential information, analogous to potential energy: work is required to release it.” (p21) The processing can be a daunting task, not to mention the most boring and often neglected documentation duties after the dataset is final and pronounced error-free by the person in charge of quality control. Data analysts are reported to spend about 80% of their working hours on data processing and not data analysis – partly, because data processing is a very laborious task and partly because they do not know if the person who sat before them at the same desk has already performed these tasks, or if the person responsible for quality control checked for errors.

While humans are much better at analysing the information and human agency is required for trustworthy AI, computers are much better at processing and documenting data. We apply to important concepts to our data service: we always process the data to the tidy format, we create an authoritative copy, and we always automatically add descriptive and processing metadata.

The tidy data format means that the data has a uniform and clear data structure and semantics, therefore it can be automatically validated for many common errors and can be automatically documented by either our software or any other professional data science application. It is not as strict as the schema for a relational database, but it is strict enough to make, among other things, importing into a database easy.

The authoritative copy is held at an independent repository, it has a globally unique identifier that protects you from accidental data loss, mixing up with unfinished an untested version.

The descriptive metadata contains information on how to find the data, access the data, join it with other data (interoperability) and use it, and reuse it, even years from now. Among others, it contains file format information and intellectual property rights information.

The processing metadata makes the data usable in strictly regulated professional environments, such as in public administration, law firms, investment consultancies, or in scientific research. We give you the entire processing history of the data, which makes peer-review or external audit much easier and cheaper.

Metadata is often more valuable and more costly to make than the data itself, yet it remains an elusive concept for senior or financial management. Metadata is information about how to correctly use the data and has no value without the data itself. Data acquisition, such as buying from a data vendor, or paying an opinion polling company, or external data consultants appears among the material costs, but metadata is never sold alone, and you do not see its cost. If the data source is cheap or has a low quality, you do not even get it. If you do not have it, it will show up as a human resource cost in research (when your analysist or junior researcher are spending countless hours to find out the missing metadata information on the correct use of the data) or in sales costs (when you try to reuse a research, consulting or legal product and you have comb through your archive and retest elements again and again.)

5.1 The Data Sisyphus

Sisyphus was punished by being forced to roll an immense boulder up a hill only for it to roll down every time it neared the top, repeating this action for eternity. This is the price that project managers and analysts pay for the inadequate documentation of their data assets. When was a file downloaded from the internet? What happened with it sense? Are their updates? Did the bibliographical reference was made for quotations? Missing values imputed? Currency translated? Who knows about it – who created a dataset, who contributed to it? Which is an intermediate format of a spreadsheet file, and which is the final, checked, approved by a senior manager? Data documentation is very boring, and like data processing, when done manually, it can be extremely time consuming. In the case of small datasets, creating manually the description to a dataset may be more costly in time and wages than the data itself. But without proper documentation, the data will be very difficult to find, even in your own computer or server, and very difficult to reuse. As the years pass, the project managers will decide that it is better to download or buy it again; the analyst will rather re-check the currency translations.

5.2 Processing

Data processing means bringing the data to the correct form: translating dollars to euros on the correct exchange rate, making sure that tons and kilograms are consistently used, or each observation is in a row and each variable is in a column. This is a very error-prone process if you do it with keystrokes, or worse still, with mouse movement; and with complex data, like surveys, or harmonized surveys, you need to make 10,000s of such steps that no human can do without errors – and no supervisor or auditor can meaningfully check for errors. Data processing adds a lot of value to data – often more than what you paid for the access to the raw data, or the uncut diamond. If you do not have a good data science team on payroll, then it will be also very-very costly to do in house, because spreadsheet application like Excel or OpenOffice, or statistical application like SPSS or STATA are very ill-equipped for these tasks to be done at scale.

Our datasets come with processing metadata: we say exactly what happened with your data, or the data we took from an open or paid source, since it arrived to us. We not only record each processing step, but we also link the code of the software that touched your data asset, we even record the description of the computation environment with compiler versions, operational system, software extension for full computational reproducibility. Your experts can review the process and replicate it. We also use codes and descriptions that follow the SDMX statistical standards to avoid mixing Fahrenheit with Celsius, million dollars with billion dollars, GDP per capita in PPP adjusted or market prices, and other similar problems. This makes the data interoperable. it is easy to correctly join with other data in the level of data analysis or import into a database for your engineers; or to find related data that you combine it with for more informative insights.

5.3 Documenting for future findability, interoperability and reuse

Our datasets are automatically coming with rich descriptive metadata, and they conform both the Dublin Core metadata standard and the DataCite metadata standard. This is a requirement in new research funding, and it is common sense to store this with every dataset you have. It allows you to store an authoritative copy of the finalized and cross-checked dataset with a unique document identifier (DOI). The DOI uniquely identifies the data, its creators, and contributors in the world, and makes sure that the authoritative data exists somewhere and cannot be accidentally deleted or modified. Our open data has an authoritative copy on the Zenodo open repository: nobody can make a claim that they created it, and we cannot accidentally lose it either. If you use our data, you can find the descriptive information both on our server and on Zenodo with the reference copy. If we work with your data, we can place the data in your choice of repository.

5.4 Accessiblity and rights

Managers often loath the idea to use any data that comes from the internet, because of the legal risk attached to it. Is a dataset protected by some intellectual property right, like a copyright, or in the EU, with a database, right? Our datasets contain rights information and a link the use license. In many cases, as it is customary with open science sources, your obligation is to reference the original creator (and/or publisher, curator, manager) of the data. In other cases, you may have other restrictions, and you may need to upgrade your license for commercial use, for example.

Our observatory’s datasets and descriptive information, including visualizations by defaults arrives with the Creative Commons Attribution Non Commercial Share Alike 4.0 International CC-BY-NC-SA license. This allows you unrestricted use, and the right to modify the dataset in your own research products, if you share the modification under a CC-BY-NC-SA license. If you need a more permissive license (for example, you want to give somebody exclusive access to your research product), we can create you a new, similar dataset with more permissive rights.

The problem with Open Data is that while it is legally accessible, and often cost-free, it in most cases not findable, and not even accessible directly. While the EU has policies since 2003 about making taxpayer funded data reusable, it did not make much technical steps to make this a reality. Reusability of governmental data and scientific data is a right, but not a practical possiblity for most users. The data may sit in various historical file formats, without documentation.

Interoperability means that you can use governmental open data, scientific open data, your own system’s data, your membership organizations shared resources, and data subscriptions together. A special case of lack of interoperability when you do not know if you are facing a legal risk from using a data resource.

In our experience, in most research-driven organizations, such as consultancies, law firms, university research groups, NGOs and public policy bodies, reusability is mainly limited by poor data documentation, and sometimes by the use of of obsolete proprietary file formats. Documentation in the short run is not always a necessity, and it belongs to the non-billable hours, or among the tasks that do not really count in a researchers’ promotion. The poor documentation however makes it extremely demanding to re-use a data or document a few years from now. From 2021, if you apply for Horizon Europe funding, your research output must meet basic findability and reusability criteria.

  • Our data comes with metadata that meets the requirements of two metadata standards, the more general Dublin Core and the more specific, mandatory and recommended values of DataCite for datasets. We go even further, we add rich processing metadata beyond these requirements. These are not only nice to have: from 2021, if you apply for Our solution can automate this process, and besides making you compliant, it adds a lot of value to your own data management.

  • We are making our data-as-service APIs FAIR by automatically adding to our data standardized metadata that fulfills the mandatory requirements of the Dublic Core metadata standards and at the same time the mandatory requirements, and most of the recommended requirements of DataCite.

  • Furthermore, we apply the tidy data concept to our partners data assets. The tidy data principle applies a certain structure to datasets that facilitates immediate use and reuse.

5.5 FAIR

In 2016, the FAIR Guiding Principles for scientific data management and stewardship were published in Scientific Data. The authors intended to provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. The principles emphasize machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data because of the increase in volume, complexity, and creation speed of data.

A practical “how to” guidance to go FAIR can be found in the Three-point FAIRification Framework.

Findable

The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.

F1. (Meta)data are assigned a globally unique and persistent identifier

F2. Data are described with rich metadata (defined by R1 below)

F3. Metadata clearly and explicitly include the identifier of the data they describe

F4. (Meta)data are registered or indexed in a searchable resource

Accessible

Once the user finds the required data, she/he/they need to know how can they be accessed, possibly including authentication and authorisation.

A1. (Meta)data are retrievable by their identifier using a standardised communications protocol

A1.1 The protocol is open, free, and universally implementable

A1.2 The protocol allows for an authentication and authorisation procedure, where necessary

A2. Metadata are accessible, even when the data are no longer available

Interoperable

The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.

I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (Meta)data use vocabularies that follow FAIR principles

I3. (Meta)data include qualified references to other (meta)data

Reusable

The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.

R1. (Meta)data are richly described with a plurality of accurate and relevant attributes

R1.1. (Meta)data are released with a clear and accessible data usage license

R1.2. (Meta)data are associated with detailed provenance

R1.3. (Meta)data meet domain-relevant community standards

The principles refer to three types of entities: data (or any digital object), metadata (information about that digital object), and infrastructure. For instance, principle F4 defines that both metadata and data are registered or indexed in a searchable resource (the infrastructure component).

5.6 The Dublin Core

See: Dublin Core Cross-Domain Attribute Set

Contributor An entity responsible for making contributions to the resource.
Coverage The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.
Creator An entity primarily responsible for making the resource.
Date A point or period of time associated with an event in the lifecycle of the resource.
Description An account of the resource.
Format The file format, physical medium, or dimensions of the resource.
Identifier An unambiguous reference to the resource within a given context.
Language A language of the resource.
Publisher An entity responsible for making the resource available.
Relation A related resource.
Rights Information about rights held in and over the resource.
Source A related resource from which the described resource is derived.
Subject The topic of the resource.
Title A name given to the resource.
Type The nature or genre of the resource.

We record all metadata that is described in the Dublin Core, but we use the DataCite property names. We will provide a simple converter to convert the metadata between Dublin Core and DataCite.

5.7 DataCite

We use all mandatory DataCite metadata fields, and many of the recommended and optional ones.

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
ID Property Structure Obligation
1 Identifier With mandatory type sub-property M
2 Creator With optional name identifier and affiliation sub-properties M
3 Title With optional type sub-properties M
4 Publisher M
5 PublicationYear M
10 ResourceType With mandatory general type description sub-property M
6 Subject With scheme sub-property R
7 Contributor With type, name identifier, and affiliation sub-properties R
8 Date With type sub-property R
9 Language O
11 AlternateIdentifier With type sub-property O
12 RelatedIdentifier With type and relation type sub-properties R
13 Size O
14 Format O
15 Version O
16 Rights O
17 Description With type sub-property R
18 GeoLocation With point, box and polygon sub-properties R
19 FundingReference With name, identifier, and award related sub-properties O
20 RelatedItem With identifier, creator, title, publication year, volume, issue, number

, page, publisher, edition, and contributor sub-properties |O |

See for further reference DataCite Descriptive Metadata.

Identifier An unambiguous reference to the resource within a given context. (Dublin Core item), but several identifiders allowed, and we will use several of them.
Creator The main researchers involved in producing the data, or the authors of the publication, in priority order. To supply multiple creators, repeat this property. (Extends the Dublin Core with multiple authors, and legal persons, and adds affiliation data.)
Title A name given to the resource. Extends Dublin Core with alternative title, subtitle, translated Title, and other title(s).
Publisher The name of the entity that holds, archives, publishes prints, distributes, releases, issues, or produces the resource. This property will be used to formulate the citation, so consider the prominence of the role. For software, use Publisher for the code repository. (Dublin Core item.)
Publication Year The year when the data was or will be made publicly available.
Resource Type We publish Datasets, Images, Report, and Data Papers. (Dublin Core item with controlled vocabulary.)

5.7.2 Optional

The Optional (O) properties are optional and provide richer description. For findability they are not so important, but to create a web service, they are essential. In the mandatory and recommended fields, we are following other metadata standards and codelists, but in the optional fields we have to build up our own system for the observatories.

Language A language of the resource. (Dublin Core item.)
Alternative Identifier An identifier or identifiers other than the primary Identifier applied to the resource being registered.
Size We give the CSV, downloadable dataset size in bytes.
Format We give file format information. We mainly use CSV and JSON, and occasionally rds and SPSS types. (Dublin Core item.)
Version The version number of the resource.
Rights We give SPDX License List standards rights description with URLs to the actual license. (Dublin Core item: Rights Management)
Funding Reference We provide the funding reference information when applicable. This is usually mandatory with public funds.
Related Item We give information about our observatory partners’ related research products, awards, grants (also Dublin Core item as Relation.) We particularly include source information when the dataset is derived from another resource (which is a Dublin Core item.)
  • In the Language we only use English (eng) at the moment.
  • By default We do not use the Alternative Identifier property. We will do this when the same dataset will be used in several observatories.
  • The Size property is measured in bytes for the CSV representation of the dataset. During creations, the software creates a temporary CSV file to check if the dataset has no writing problems, and measures the dataset size.
  • The Version property needs further work. For a daily re-freshing API we need to find an applicable versioning system.
  • The Funding reference will contain information for donors, sponsors, and co-financing partners.
  • Our default setting for Rights is the CC-BY-NC-SA-4.0 license and we provide an URI for the license document.
  • In the RelatedItem we give information about:
    • The original (raw) data source.
    • Methodological bibilography reference, when needed.
    • The open-source statistical software code that processed the data.

5.8 Administrative Metadata

Like with diamonds, it is better to know the history of a dataset, too. Our administrative metadata contains codelists that follow the SXDX statistical metadata standards, and similarly strucutred information about the processing history of the dataset.

See for further reference The codebook Class.

Observation Status SDMX Code list for Observation Status 2.2 (CL_OBS_STATUS), such as actual, missing, imputed, etc. values.
Method If the value is estimated, we provide modelling information.
Unit We provide the measurement unit of the data (when applicable.)
Frequency SDMX Code list for Frequency 2.1 (CL_FREQ) frequency values
Codelist Euros-SDMX Codelist entries for the observational units, such as sex, etc.
Imputation SDMX Code list for Frequency 2.1 (CL_IMPUT_METH) imputation values
Estimation The estimation methodology of data that we calculated, together with citation information and URI to the actual processing code
Related Item We give information about the software code that processed the data (both Dublin Core and DataCite compliant.)

See an example in the The codebook Class article of the dataobservatory R package.

5.9 An Example

## DataCite information for Population of Small European Countries 
## # A tibble: 21 x 2
##    Property        Value                                                        
##    <chr>           <chr>                                                        
##  1 dataset_code    "small_population_total"                                     
##  2 Identifier      "small_population_total"                                     
##  3 Creator         "Joe, Doe"                                                   
##  4 Title           "Population of Small European Countries"                     
##  5 Publisher       "Reprex"                                                     
##  6 PublicationYear "2021"                                                       
##  7 ResourceType    "Dataset"                                                    
##  8 Subject         "Demography"                                                 
##  9 Contributor      <NA>                                                        
## 10 Date            "{\"Updated\":[\"2021-09-20\"],\"EarliestObservation\":[\"20~
## # ... with 11 more rows

The Description property has three mandatory elements:

  • The Abstract is a short, textual description.
  • In the TechnicalInfo sub-field, we record automatically the utils::sessionInfo() for computational reproducability.
  • In the Other sub-field, we record the keywords for structuring the observatory.
## $Abstract
## [1] "Example dataset with three small countries"
## 
## $TechnicalInfo
## [1] "{\"platform\":[\"x86_64-w64-mingw32/x64 (64-bit)\"],\"locale\":[\"LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252\"],\"running\":[\"Windows 10 x64 (build 19042)\"],\"RNGkind\":[\"Mersenne-Twister\",\"Inversion\",\"Rejection\"],\"basePkgs\":[\"stats\",\"graphics\",\"grDevices\",\"utils\",\"datasets\",\"methods\",\"base\"],\"matprod\":[\"default\"],\"BLAS\":[\"\"],\"LAPACK\":[\"\"],\"system.codepage\":[\"1250\"],\"codepage\":[\"1252\"]}"
## 
## $Other
## [1] "{\"id\":[\"keyword1\",\"keyword2\",\"keyword3\",\"keyword4\"],\"name\":[\"greendeal\",\"Demography\",\"Population\",\"Testing\"]}"

The Geolocation Property:

## $geoLocationPlace
## [1] NA
## 
## $geoCodes
## [1] "LI|AD|SM"

5.10 Tidy Data

A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected. The tidy data principle applies a certain structure to datasets that facilitates immediate use and reuse.

The principles of tidy data provide a standard way to organise data values within a dataset. A standard makes initial data cleaning easier because you don’t need to start from scratch and reinvent the wheel every time.

Tidy data is a standard way of mapping the meaning of a dataset to its structure (. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

In tidy data we need a way to describe the underlying semantics, or meaning, of the values displayed in the table:

  • Every column is a variable.
  • Every row is an observation.
  • Every cell is a single value.

5.11 Messy data

Real datasets can, and often do, violate the three precepts of tidy data in almost every way imaginable – this particulary true of open data, even when it is released by statistical bodies. The most typical errors:

  • Column headers are values, not variable names.
  • Multiple variables are stored in one column, for example, the number of item purchased and the price of the item.
  • Variables are stored in both rows and columns, for example, the columns are organized by income level groups.
  • Multiple types of observational units are stored in the same table. For example, names of musicians and their songs.
  • A single observational unit is stored in multiple tables.

While messy data almost always can be tidied, if you do this in a spreadsheet application manually, it is almost impossible not to make a mistake. Spreadsheet applications do not check the tidyness of the data, and do not record the logs of your manual tidying efforts. Unless every single mouseclick, drag and drop is recorded, the work is impossible to validate. However, we believe that data should never be manipulated with a mouse. Computer algorithms should be deployed in a way that our their tidying efforts are reproducible and logged.