Metadata and documentation

If you have data, you should inevitably have metadata. Metadata and documentation provide information about your data, such as the topic of your project or the circumstances under which your data were obtained. Without metadata and documentation, a lot of data are just numbers without any meaning. Therefore, this “data about data” is essential to understand, find, reuse and manage (the context of) your data.

The distinction between metadata and documentation is not always straightforward. In this guide, we consider metadata as structured, machine readable and interoperable information about your data, and documentation as mostly human-readable information about a project and its data. Both are important to provide context about your data!

Don’t feel like reading? Watch the video below:

Project-level documentation

Project-level documentation ensures that your future you and others can find the data and understand its context. Who contributed to the project, when did the project take place, how were the data collected and analyzed, who can (re)use the data?

There is a lot of information that you can provide on the project-level and you can do so in many ways, such as in a README file (e.g., a text or markdown file), separate documents in the dataset, or in the form of links to externally published information (e.g., manuscript, study preregistration, etc.). If you deposit your (meta)data in a data repository, some basic project-level information is added by the repository as well in the form of structured metadata.

Project-level documentation can consist of:

Basic information about the project

Title, project ID, author(s) and contributor(s), institution involved, funder, grant number, references to related projects and publications, when and where the data were collected, contact person for the dataset, etc. This information is often included in a README file in the root of a project folder or in the structured metadata in a data repository.

Content of the project or dataset

Keywords, subject area and abstract are often included in the structured metadata in a data repository. Additionally, the README file in the root of a project folder will often contain an explanation of the files and folders in the dataset, the data types, size, file formats, file versions, etc.

Methodological information

How were the data created and analyzed? For example, instruments and instrument settings, experimental protocol, target population and sampling methods, data cleaning and analysis workflow, scripts and tools used for data capture and analysis, pseudonymisation methods, quality assurance methods, etc. This information is usually included in a manuscript, study preregistration, study protocol, scripts, experiment files, etc., but could also be (partly) described in a README file.

Data access and reuse

Preservation period, identifier of the dataset(s) (if published), data access conditions and responsible parties, data license, citation information, etc. This information is often included in the structured metadata of a data repository, in a README file or in a separate access protocol document.

Administrative documents

Data Management Plan, funding proposals, ethical applications, Data Protection Impact Assessment, agreements, information letter to participants, etc. These documents are often a part of your dataset (but please separate administrative data, such as participants' contact information, from the research data).

Data-level documentation

Data-level documentation can be used to explain how to interpret your data. Whereas most documentation has to be created manually, file-level technical metadata (e.g., date of creation, author, etc.) can sometimes be extracted with tools or scripts, e.g., the Extract Metadata tool.

Some examples of data-level documentation are:

A codebook or data dictionary

Not all users of your data (you included) may understand that the variable “length” measures the height of a person in cm. To help them understand, a data dictionary or codebook is often used to explain the variable names, labels (e.g., F = female, M = male), measurement units, response options, descriptions and sometimes even data summaries (e.g., frequencies, number of missing values, means). A codebook can simply be a pdf document, but we recommend creating a machine-readable codebook (e.g., in .csv, .json, or .xml format), for example using the “Cookbook for a codebook” template. There are also programming packages available to automatically create a codebook from a dataset (e.g., the codebook R package).

A laboratory notebook

Lab notebooks often contain unstructured information about the circumstances under which data were collected. This kind of information may be important for the way the data are analyzed or processed further. Utrecht University offers the eLabJournal tool to keep a lab notebook, read more here (this page leads you to the UU intranet).

File overview sheet

If your dataset consists of several data files (e.g., several interview scripts) and it is not immediately apparent what is in them, you can create a file overview sheet. The content of such an overview file will depend on your dataset. Below is an example:

id	researcher	date	subject-age	script	datafile
1	Alice	2023-01-10	25	FLProj_subj1_script.R	FLProj_subj1_raw.txt
2	William	2023-01-15	30	FLProj_subj2_script.R	FLProj_subj2_raw.txt
3	Rashida	2023-02-15	23	FLProj_subj3_script.R	FLProj_subj3_raw.txt
4	Alice	2023-03-26	23	FLProj_subj4_script.R	FLProj_subj4_raw.txt
5	Rashida	2023-03-26	31	FLProj_subj5_script.R	FLProj_subj5_raw.txt

Headers or summaries at the start of a text file

For textual qualitative data, relevant metadata can be added to the header of a file or as a summary at the start of a data file. This can for example be useful for interpreting interview data and may concern participants' age, occupation, date, location of the measurement, etc. Alternatively, an overview sheet may also be useful for this type of data.

Metadata standards and vocabularies

Whereas documentation is human-readable, metadata schemas are also machine-readable and can be standardized. Metadata standards make sure that it is clear what is meant exactly with each field in the metadata, and enables findability and interoperability of your data. For example, the field “Creator” in the Dublin Core standard can be interpreted by every machine as the creator of the object it describes. This allows systematic searches and filtering on objects that have the same value (name) in the Creator field.

There are several general and field-specific metadata standards, in which often used metadata fields are defined, such as Dublin Core (generic) or Data Documentation Initiative (social sciences). You can find other standards in the Metadata Standards Catalog. Metadata standards can be used to describe datasets in data repositories, or publications in a library system. For example, most libraries will use the generic Dublin Core metadata standard to describe publications, whereas data repositories such as DataverseNL or Yoda use the DataCite standard.

Machine-readable metadata can exist either embedded in files themselves, or as a separate file alongside the data (e.g., JSON, XML). Some image files can for example contain EXIF metadata, whereas Yoda datasets will have an accompanying JSON file that contains the project-level metadata.

Controlled vocabularies

You may use concepts in your project that have a specific meaning and that require a definition to avoid confusion. For example, “sugar collected”, “rewards”, and “pellets” could all mean the same thing, and using one term with a standard definition (e.g., “reward”) may help avoid confusion as to what is meant and which word to use.

Such concepts can be defined in ontologies and controlled vocabularies. Ontologies are vocabularies that you can use to describe constructs and their relationships (e.g., a vehicle on four wheels is called a “car”, of which a wheel is a component). Controlled vocabularies are non-hierarchical lists of standard terms and their definitions.

A controlled vocabulary is something you can publish. That way, each controlled term gets its own persistent identifier (usually a Uniform Research Identifier, URI), which uniquely identifies the term and its definition, and which can also be used by others. If you use those predefined terms with their identifier in your data and metadata, both humans and machines know what you mean with that term, and datasets using the same terms can be easily linked together. For example, you can use controlled terms and their identifier as keywords in the metadata of DataverseNL datasets.

There is no “one” vocabulary. Instead, there are many discipline-specific vocabularies that you can use to describe your data. Some examples:

Bioportal, a collection of biomedical ontologies;
Chemical Entities of Biological Interest, an ontology for chemical compounds;
Medical Subject Headings (MeSH), a general vocabulary for medical terminology;
ISO 3611, ISO 4217, and ISO 639, vocabularies for countries, currencies, and language codes;
The General Multilingual Environmental Thesaurus (GEMET), a vocabulary describing geographic and environmental data;
The Awesome Ontologies for the Social Sciences lists all the controlled vocabularies that are relevant for the social sciences, like the European Language Social Science Thesaurus and many more.
Vocabularies maintained by the EU Publications Office, to standardize names and concepts across the 20+ languages in EU documents.

Support with metadata and documentation

If you have any questions or need help providing metadata and documentation in your project, please feel free to contact us. We can for example help you make your dataset easier to understand by others, or choose or create a metadata scheme relevant for your research project.