Documenting data and derived datasets
Accurate documentation is crucial for understanding and interpreting your data, both during the research process and for future use or sharing with others. Clear documentation ensures transparency, reproducibility and helps others to understand and evaluate your findings.
Documentation that accompanies a certain dataset or study contains the metadata, or data about the data.
There are different levels and types of data documentation to be aware of when using or creating documentation.
Data documentation levels
- Focuses on the study as a whole and includes information about the research design, methodology, sampling, data collection methods, and questionnaire(s).
- For longitudinal population studies, the study-level documentation focuses on the overall aspects of the study, such as the study timeline, frequency of data collection, and structure of the longitudinal dataset. It could also include information about unique identifiers assigned to individuals across waves, data harmonisation procedures and longitudinal missing data strategies.
Specifically for longitudinal population studies, sweep/wave-level documentation will contain similar information to study-level documentation, but specific to a certain wave of data collection.
This is commonly in the form of a codebook and contains information about variable names, question texts, labels and descriptions, coding frames, and missing data codes.
Most statistical programs can produce a codebook for a given dataset, for example:
Focuses on the individual files or datasets within a study. It includes details about the file format, data structure, record layout, missing values, and any transformations applied.
It may also specify relationships between files or provide information about data subsets.
If data from other sources have been linked to a dataset, the linkage documentation describes the linkage procedures, including linkage variables, matching algorithms, and privacy protection measures.
If you are creating documentation for any data, it is particularly important to familiarise yourself with metadata standards which provide guidelines for describing and documenting research data.
Following these standards ensures consistency and compatibility with other datasets, which enables future data integration and comparison.
Some commonly used data standards include, and are not limited to:
The Data Documentation Initiative (DDI) has two main standards that are applicable to different types of data:
- DDI Codebook – to document simple or single survey data
- DDI Lifecyle – to document the lifecycle of a longitudinal dataset or multiple datasets.