Processing your data
Once you have identified the appropriate storage for your data, it may need to be processed before the data is ready for analysis or deposited for others to use.
Some processing of the data may still be required when you are using secondary data for your analysis—even when many of these steps will likely have been carried out by the data/study team who collected and deposited the data.
Best practice when using secondary data, is to ensure that you have a backup of the raw data files (e.g., the files you downloaded from the study) before you start processing the data in case the data are accidentally overwritten. To avoid this scenario, save any changes you make to the dataset as a new, separate dataset and don’t write over the original dataset.
Data processing steps that may be required:
This step involves identifying and handling outliers, dealing with duplicate entries and removing irrelevant or unnecessary data.
It also includes checking for data integrity issues, such as ensuring that data are entered correctly and consistently.
Data integration is important if you are working with multiple datasets or files of data. It involves combining and merging data to create a unified dataset.
This step require resolving discrepancies, standardising/harmonising variables and establishing common identifiers to link related data.
There may be some key derived variables that need to be created in the dataset prior to analysis or deposit. For example, many of the studies that are part of CLOSER release derived variables alongside the raw data, such as variables on wealth and income, BMI and physical activity.
These variables are often derived using multiple questions from the survey, data from previous waves of data collection, or are a higher-level summary than the raw data (e.g., geographic area or ethnic group).
Deriving these before depositing the data ensures that data users are using a correctly derived variable. The derivation process should always be documented in the documentation and/or code.
A key part of processing data ahead of depositing that data for use by others, is ensuring that disclosive material is removed so individuals cannot be identified (or at least not be identified without the use of additional information).
Even after data has been pseudonymised or anonymised, there may be the potential that disclosive information remains.
Read more about removing disclosive material in the next section on Anonymisation and pseudonymisation.