Skip to content
Closer - The home of longitudinal research

Anonymisation and pseudonymisation

Anonymisation or pseudonymisation is a crucial step to protect study participants’ privacy and confidentiality. If you are using secondary data, anonymisation or pseudonymisation will usually have been carried out by the team who collected and/or shared the data.

However, it is important to understand these processes, including how and why this has been carried out for the data—it may also explain some quirks that you see in your data (see statistical disclosure control below).

How can individuals potentially be identified?

Using data from a research study without carrying out anonymisation or pseudonymisation could make an individual directly identifiable from their name, address, email address, phone number or other unique personal characteristic.

Individual participants may also be indirectly identifiable when certain information from the study is linked to other sources of information, such as their place of work, postcode, or diagnosis of a medical condition.

A third way that individual participants can be identified is if they are known to the researcher, and their characteristics are so unique that they can be pinpointed within a dataset. For example, this could be through the intersection of their ethnicity, occupation, age, and gender.

Small cell counts within datasets pose high risk for identifying individuals through indirect methods. Typically, study teams will eliminate small cell counts by combining scores and variables;  find out more in the statistical disclosure control section below.

Striking a balance

There is a balance to consider when anonymising research data, as true anonymisation can sometimes remove so much information from the data that it is no longer useful for research.

For example, an anonymised dataset might contain only a few variables, such as educational aspirations and weight categories of 10kg each. It would be impossible to identify an individual based on their attitude towards education and weight category, out of a dataset of thousands of people with similar attitudes and weight. However, this dataset would also limit the possible research questions that could be investigated.

Pseudonymisation is therefore often the most appropriate method for research data, to preserve fine grained categories for analysis and rich information about study participants.

Learn more about anonymisation, pseudonymisation, and GDPR regulation for personal data from the links below:

Statistical disclosure control

Even after data have been anonymised, there may be the potential for disclosive information to remain.

Statistical disclosure control refers to methods to protect the anonymity and confidentiality of data. For example, the UK Office for National Statistics (ONS) carried out statistical disclosure control on the UK 2021 Census data to ensure there was no identifying information in outputs and data requested by users.