Harmonisation methods
There are different methods which can be used to harmonise data between studies. This section gives a brief introduction to some of the methods to make data more comparable.
The methods highlighted in this section mainly refer to retrospective harmonisation, where existing data is processed to be more comparable.
This approach involves finding similar variables in the different datasets and cleaning them in a consistent way so that they can be compared. This method might also involve reducing the sample in the studies to those individuals who would be more comparable on the variables of interest.
For example, this paper using data from the CLOSER harmonised dataset on body composition excluded groups from their analytical sample who were likely to have particularly different body mass index (BMI) values. This allowed comparative analyses and to see trends in body mass over time.
PAPER: How Has the Age-Related Process of Overweight or Obesity Development Changed over Time? Co-ordinated Analyses of Individual Participant Data from Five United Kingdom Birth Cohorts by Johnson, Li, Kuh & Hardy (2015)
This approach involves putting the variables of interest on the same scale or using the same categories. This means that similar variables that were measured slightly differently are now comparable. A limitation of this method is that it can lead to less informative scales being used, as the scale or categories that are chosen are normally the lowest common denominator and so might not have the granularity of the more informative scales (Bann et al. 2022).
An example of equivalent categorisation for marital status is displayed below. Note how the final categorisation has to combine multiple categories from Study 1 to match Study 2 and how this affects the granularity of the data.
[INSERT TABLE]
Coding systems
For equivalent categorisation, a specific coding system might be used which sets out rules for what each category includes.
For example, the researchers generating the CLOSER harmonised dataset of socioeconomic measures used one coding system for occupational class that could be applied to all the included studies. This was difficult as the studies spanned 55 years, during which the official coding systems for occupational classification changed every decade. The coding system that was chosen sat chronologically in the middle of the time period and allowed the measures from all the included studies (with some minor exceptions) to be converted to this scale.
Read more about the harmonisation process for the CLOSER socioeconomic measures dataset in the dataset user guide https://doc.ukdataservice.ac.uk/doc/8307/mrdoc/pdf/closer_wp2ses_userguide_202203revision.pdf
Meaningful cut-offs
Equivalent categories could be generated by using accepted cut-offs that have been clinically defined to indicate an increased risk of a disease or condition, or otherwise verified as a meaningful cut-off. The value of these cut-offs might change over time or between countries so the same should be used in all the included studies.
For example, the cut-offs for waist circumference that indicate an increased risk of heart and circulatory diseases are different for different ethnic groups. For people of white European, black African, Middle Eastern, and mixed origin, a waist circumference of more than 102cm for men and 88cm for women indicates very high risk. However, for people of African Caribbean, South Asian, Chinese and Japanese origin, a circumference of more than 90cm for men and 80cm for women indicates very high risk (British Heart Foundation).
Standardisation
Continuous variables can also be transformed so they are on an equivalent scale through methods like standardisation or normalising to the mean. A common standardisation method is z-score transformation where the individual scores as expressed in terms of standard deviations from the mean. A z-score is the original score, minus the mean, divided by the standard deviation.
For example, the CLOSER harmonised dataset on childhood environment and adult wellbeing used z-score transformation to standardise the variables for parental care.
[insert table]
Read more about the harmonisation process for the CLOSER harmonised dataset on childhood environment and adult wellbeing in the dataset user guide https://doc.ukdataservice.ac.uk/doc/8552/mrdoc/pdf/closer_wp9_userguide_202203revision.pdf
Normalising to the mean is a similar method where the mean is subtracted from each score, so the normalised scores are expressed in terms of units away from the mean.
Other methods for equivalent scaling or categorisation include:
- Using ridit scores
- Rescaling variables to sit within a given minimum and maximum value
- Using standard equivalence scales, such as the OECD (Organisation for Economic Co-operation and Development)-modified equivalence scale for household income
Natural language processing (NLP) is a branch of computer science and linguistics. It describes the process of computers understanding and analysing text and spoken words, which is enabled through machine learning and artificial intelligence.
NLP has been employed by the Harmony project to harmonise mental health-related questionnaires. The NLP in Harmony is applied to questionnaire items (questions) and determines which items are more semantically similar and which are more different, matches similar items, and assigns the match a score.
Read more about how Harmony works in this blog https://harmonydata.ac.uk/how-does-harmony-work/
There are multiple statistical techniques which involve modelling a selection of items or variables as representing the same underlying construct.
For example, a questionnaire to assess depression has multiple questions asking about different feelings and behaviours that are often connected to being depressed. Answers to these questions gives us an idea of the underlying depression construct, even though we can’t see or measure depression itself.
Some of these statistical methods include:
- Structural equation modelling
- Factor analysis e.g. confirmatory factor analysis, moderated nonlinear factor analysis
- Item response theory
These latent variable modelling techniques normally allow for the testing of measurement invariance – a formal statistical test for the equivalence of measurement between the different items or variables.
Read more about measurement invariance in the Methodological guidance section [link to section]
For an example of using confirmatory factor analysis to determine the measurement equivalence of measures from different studies, see the harmonisation report from the CLOSER harmonised mental health measures dataset
Harmonisation and measurement properties of mental health measures in six British cohorts https://www.closer.ac.uk/wp-content/uploads/210715-Harmonisation-measurement-properties-mental-health-measures-british-cohorts.pdf
There are a selection of statistical methods used to harmonise data by converting scores on one scale into equivalent scores on another scale. They use a particular algorithm or set of rules to carry out this transformation. In this way there are similar to manual equivalent scaling or categorisation but use statistical rules to decide which values or categories are equivalent.
For example, if an individual scores 10 on one scale of depression (e.g. the Malaise-24), the algorithm determines that the equivalent score on the other scale (e.g. the GHQ-12) is 15.
These techniques are useful when different measurement tools have been used in different studies but are measuring the same underlying construct.
Some of these statistical methods include:
- Equipercentile linking
- Calibrated cut-off
- Multiple imputation
See this paper for a comparison of these techniques to harmonise psychological distress:
PAPER: Psychological Distress Across Adulthood: Equating Scales in Three British Birth Cohorts by Jongsma et al. (2022)
Read more about different methods for harmonising data in this paper:
PAPER: Statistical approaches to harmonize data on cognitive measures in systematic reviews are rarely reported by Griffith et al. (2015)
This section is part of the CLOSER Training Hub. Go back to the Training Hub homepage.
Explore the Training Hub: