Anonymisation and pseudonymisation
Anonymisation or pseudonymisation is a crucial step to protect study participants’ privacy and confidentiality. If you are using secondary data, anonymisation or pseudonymisation will usually have been carried out by the team who collected and/or shared the data.
However, it is important to understand these processes, including how and why this has been carried out for the data—it may also explain some quirks that you see in your data (see statistical disclosure control below).
How can individuals potentially be identified?
Using data from a research study without carrying out anonymisation or pseudonymisation could make an individual directly identifiable from their name, address, email address, phone number or other unique personal characteristic.
Individual participants may also be indirectly identifiable when certain information from the study is linked to other sources of information, such as their place of work, postcode, or diagnosis of a medical condition.
A third way that individual participants can be identified is if they are known to the researcher, and their characteristics are so unique that they can be pinpointed within a dataset. For example, this could be through the intersection of their ethnicity, occupation, age, and gender.
Small cell counts within datasets pose high risk for identifying individuals through indirect methods. Typically, study teams will eliminate small cell counts by combining scores and variables; find out more in the statistical disclosure control section below.
Anonymisation refers to the process of removing any personally identifying data—direct and indirect—so that individuals cannot be identified. True anonymised data is no longer within the scope of the EU General Data Protection Regulation (GDPR) because the data are no longer ‘personal’ i.e., linked to an identifiable person.
Pseudonymisation refers to the process of removing personal data from a research study in such a way that the data can no longer be attributed to a specific individual without the use of additional information—with the additional information protected and stored separately.
Striking a balance
There is a balance to consider when anonymising research data, as true anonymisation can sometimes remove so much information from the data that it is no longer useful for research.
For example, an anonymised dataset might contain only a few variables, such as educational aspirations and weight categories of 10kg each. It would be impossible to identify an individual based on their attitude towards education and weight category, out of a dataset of thousands of people with similar attitudes and weight. However, this dataset would also limit the possible research questions that could be investigated.
Pseudonymisation is therefore often the most appropriate method for research data, to preserve fine grained categories for analysis and rich information about study participants.
Learn more about anonymisation, pseudonymisation, and GDPR regulation for personal data from the links below:
Statistical disclosure control
Even after data have been anonymised, there may be the potential for disclosive information to remain.
Disclosive information refers to information that has the potential to reveal sensitive or personally identifiable details about individuals. More salient examples of disclosive information include names, addresses, National Insurance numbers, email addresses, and medical records.
However, other information that has the potential to be disclosive is less obvious. For example, the age and/or year of birth of individuals taking part in studies who are aged over 90 is often summarised into a “90+” category where there is only a small number of individuals to whom this applies. This is because with the detailed information contained in the rest of the survey, it might be possible for an individual to be identified based on this age information.
Statistical disclosure control refers to methods to protect the anonymity and confidentiality of data. For example, the UK Office for National Statistics (ONS) carried out statistical disclosure control on the UK 2021 Census data to ensure there was no identifying information in outputs and data requested by users.