Understanding your data
Once you access your data, it is important to look through the documentation and data so you can fully understand the data, including its structure, limitations, potential biases and any other features that may affect your analysis.
Some key considerations that are specifically relevant to longitudinal population studies are outlined below.
It is important to determine how missing data have been treated in your dataset before you run descriptive statistics or more detailed analyses.
Statistical programs like SPSS, Stata and R have specific ways of tagging missing data so that the program treats it as missing and by default does not include it in the analysis (termed “system missing”), but they each do this in different ways.
System missing in Stata and SPSS are denoted with a full stop “.” and in R with “NA”. In Mplus scripts, users must indicate the missing value which is user defined (see below) in their dataset, e.g., -999.
User defined missing
User defined missing is when certain values of a variable are assigned to mean missing by the user. This enables the data to show different types of missing, such as “Refusal”, “Not applicable”, or “No answer”.
Any value can be given a label to show that it is missing, but it is important to check the values that have been assigned to a missing category, as they are often values that the statistical program will include in analyses.
For example, it is common in longitudinal population studies to have categories of missing data assigned to minus numbers (e.g., -9 = No response; -8 = Refusal) or large numbers (e.g., 99 = Not applicable). If the statistical program you are using has not been programmed to treat these values as missing, it will include them in any analysis and statistics you run which will give erroneous results.
How to deal with missing data
If the different categories of missing are not important for your analysis, you may want to recode them all as system missing so they are ignored by your program and your results are only based on non-missing values.
Another option is to use extended missing values if these are available in your program. For example, Stata has 26 extended missing values in addition to the default “.” which are denoted by “.a” through to “.z”. These allow users to define missing values in a way which means they are ignored in any statistics or analysis but allows for information about the type of missing data to be kept.
Be aware of how your statistical program treats missing values to make sure you are not unwittingly including or excluding these values. For example, Stata treats missing values as very large numbers so they may be included in certain conditional statements such as “if age > 50”.
Once you understand how missing data have been treated in your dataset and make any required adjustments, you might want to consider how to handle the missing data in your analyses.
For training resources and courses specifically about handling missing data, go to the Training opportunities section and filter by Missing Data
Longitudinal population studies often use weights to account for complex sampling designs and enable the data to better represent the population it is designed to cover. Survey weights are often provided by the study team alongside the deposited survey data; studies may also provide documentation and guidance specifically about their survey weights.
It is important to determine whether you need to use weights in your statistics and analysis and where these can be found.
UK Data Service: Weights in social surveys
Some survey questions are only presented to a sub-sample of the study members because they are not relevant for the whole sample. Information about the routing of questions can be found in the documentation for a specific questionnaire or interview.
It is important to check whether the questions and variables you are interested in have been routed, which means they are only answered by a sub-sample. If they are, you may need to use other variables alongside your variable of interest to fully capture the whole sample.
For example, if someone answers “yes” to “Do you smoke cigarettes?”, they may be asked some follow-up questions, such as the type of cigarettes they smoke or how many cigarettes they smoke in a day. However, anyone who responded “no” to the original question won’t be asked the follow-up questions which would be irrelevant.
In this scenario, to capture the whole sample in one variable about quantity of cigarettes smoked, you would need to combine the original question and follow-up question: