Harmonisation methods

There are different methods which can be used to harmonise data between studies. This section gives a brief introduction to some of the methods to make data more comparable.

The methods highlighted in this section mainly refer to retrospective harmonisation, where existing data is processed to be more comparable.

Manually choosing similar variables or individuals

This approach involves finding similar variables in the different datasets and cleaning them in a consistent way so that they can be compared.

People looking at data with a magnifying glass icon

This method might also involve reducing the sample in the studies to those individuals who would be more comparable on the variables of interest.

For example, this PLOS Medicine paper by Johnson, Li, Kuh and Hardy (2015) uses data from the CLOSER harmonised dataset on body composition excluded groups from their analytical sample who were likely to have particularly different body mass index (BMI) values.

This allowed comparative analyses and to see trends in body mass over time.

PLOS article

How Has the Age-Related Process of Overweight or Obesity Development Changed over Time?

Johnson, Li, Kuh and Hardy (2015) used the approach of manually choosing similar variables or individuals in this research allowing them to conduct comparative analyses and to see trends in body mass over time.

Equivalent scaling or categorisation

This approach involves putting the variables of interest on the same scale or using the same categories. This means that similar variables that were measured slightly differently are now comparable. A limitation of this method is that it can lead to less informative scales being used, as the scale or categories that are chosen are normally the lowest common denominator and so might not have the granularity of the more informative scales (Bann et al. 2022).

An example of equivalent categorisation for marital status is displayed below. Note how the final categorisation has to combine multiple categories from Study 1 to match Study 2 and how this affects the granularity of the data.

Study	Marital status	Marital status – equivalent categorisation
Study 1	1 = Single 2 = Single with a partner 3 = Married 4 = Divorced 5 = Separated 6 = Widowed	1 = Single / Single with a partner 2 = Married 3 = Divorced / Separated 4 = Widowed
Study 2	1 = Single 2 = Married 3 = Divorced or separated 4 = Widowed

There are different ways in which equivalent categories or scaling can be achieved. You can explore these below:

Study	Parental care variable – Unstandardised	Parental care variable – Standardised
1946 National Survey of Health and Development (NSHD)	Ranged from 0 (not caring) to 33 (very caring)	Ranged from -3.73 to 1.50
1970 British Cohort Study (BCS70)	Ranged from 0 (not caring) to 7 (very caring)	Ranged from -2.65 to 1.15

Identify similar variables with Natural Language Processing

Natural language processing (NLP) is a branch of computer science and linguistics. It describes the process of computers understanding and analysing text and spoken words, which is enabled through machine learning and artificial intelligence.

NLP has been employed by the Harmony project to harmonise mental health-related questionnaires. The NLP in Harmony is applied to questionnaire items (questions) and determines which items are more semantically similar and which are more different, matches similar items, and assigns the match a score.

Harmony Blog

How does Harmony work?

Use latent variable approaches to confirm two variables are similar

There are multiple statistical techniques which involve modelling a selection of items or variables as representing the same underlying construct.

For example, a questionnaire to assess depression has multiple questions asking about different feelings and behaviours that are often connected to being depressed. Answers to these questions gives us an idea of the underlying depression construct, even though we can’t see or measure depression itself.

Some of these statistical methods include:

Structural equation modelling
Factor analysis e.g. confirmatory factor analysis, moderated nonlinear factor analysis
Item response theory

These latent variable modelling techniques normally allow for the testing of measurement invariance – a formal statistical test for the equivalence of measurement between the different items or variables.

Read more about measurement invariance in the Methodological guidance section.

Use a statistical algorithm to convert one variable into another

There are a selection of statistical methods used to harmonise data by converting scores on one scale into equivalent scores on another scale. They use a particular algorithm or set of rules to carry out this transformation. In this way there are similar to manual equivalent scaling or categorisation but use statistical rules to decide which values or categories are equivalent.

For example, if an individual scores 10 on one scale of depression (e.g. the Malaise-24), the algorithm determines that the equivalent score on the other scale (e.g. the GHQ-12) is 15.

These techniques are useful when different measurement tools have been used in different studies but are measuring the same underlying construct.

Some of these statistical methods include:

Equipercentile linking
Calibrated cut-off
Multiple imputation

Read more about these and other harmonisation techniques in the below articles:

Article

Harmonisation methods

Manually choosing similar variables or individuals

How Has the Age-Related Process of Overweight or Obesity Development Changed over Time?

Equivalent scaling or categorisation

Identify similar variables with Natural Language Processing

How does Harmony work?

Use latent variable approaches to confirm two variables are similar

Use a statistical algorithm to convert one variable into another

Psychological Distress Across Adulthood: Equating Scales in Three British Birth Cohorts

Statistical approaches to harmonize data on cognitive measures in systematic reviews are rarely reported

Harmonised socio-economic measures

Harmonised childhood environment and adult wellbeing measures

Harmonisation and measurement properties of mental health measures in six British cohorts

Explore the Training Hub

Cross-study research

Data management

Dissemination and impact

Training opportunities

Harmonisation methods

Manually choosing similar variables or individuals

How Has the Age-Related Process of Overweight or Obesity Development Changed over Time?

Equivalent scaling or categorisation

Coding systems

Meaningful cut-offs

Standardisation

Other methods for equivalent scaling or categorisation

Identify similar variables with Natural Language Processing

How does Harmony work?

Use latent variable approaches to confirm two variables are similar

Use a statistical algorithm to convert one variable into another

Psychological Distress Across Adulthood: Equating Scales in Three British Birth Cohorts

Statistical approaches to harmonize data on cognitive measures in systematic reviews are rarely reported

Harmonised socio-economic measures

Harmonised childhood environment and adult wellbeing measures

Harmonisation and measurement properties of mental health measures in six British cohorts

Explore the Training Hub

Cross-study research

Data management

Dissemination and impact

Training opportunities