Work on the Uniform Search Platform – the interface to the hundreds of thousands of variables, questions and instruments used by the CLOSER studies – is well under way.
The first stage of this work, and by far the most substantial, involves neither the search nor the platform. Instead, we are focusing on the uniform: standardising the metadata used to describe each study’s data holdings.
What is metadata?
Metadata, briefly, is data about data. Sourcing data appropriate for your research can be complex, with many variables to choose from, often showing related information. Before you can select the data that will answer your research question, you need to know what is available. Metadata helps signpost the data so you can find it.
For example, you might be interested in analysing smoking. There may be variables on how often people smoke, how long they have smoked for, what types of tobacco products they smoke, etc. You need to know what is there before you can pick the appropriate measure. If all those variables are tagged ‘smoking’ in their metadata, a simple search will find them for you.
Going a step further, CLOSER seeks to combine the metadata of multiple studies, so that when you search for smoking, you find the smoking variables in the 1986 sweep of the British Cohort Study, alongside the 1974 sweep of the National Child Development Study, and so on.
Once upon a time …
…there was metadata. In fact, all of the studies involved in CLOSER already have some metadata describing their data. The problem for CLOSER is twofold. Firstly, there is inevitably some inconsistency in the metadata used by different studies at different times. For example, in one study, questions about coronary heart disease might be tagged ‘CHD’. In another study, they might be tagged ‘cardiovascular systems’. Secondly, the coverage of the metadata is patchy; both in terms of what is described and how it is formatted and stored.
The solution to this is to find and use a common standard across the whole of the CLOSER metadata. We have chosen DDI Lifecycle (DDI-L). DDI, or the Data Documentation Initiative, is an international standard for the documentation of social science metadata. It provides a common structure and format for all the types of metadata you might want to collect. By applying DDI-L to all of the studies in CLOSER, we can ensure consistency and comparability, not only across the CLOSER studies but also potentially with any other metadata created in the DDI-L format.
What needs doing?
The good news is that much of the existing variable metadata across the studies can be easily mapped to the DDI standard. The bad news is that almost none of the studies hold questionnaire metadata, which is a key part of our vision for the search platform. Questionnaire metadata – or metadata for any other data collection method – is important to a greater understanding of the data being used. If you have a variable ‘years smoking’ in a study, for example, there is still a lot of additional information you need to interpret that variable:
- How old were the participants when they were being asked?
- Were the participants only being asked about tobacco products, or asked about anything they might smoke?
- How might where the question was situated in a questionnaire affect responses?
- Were participants being asked about their last consecutive period of smoking, or any period of smoking?
The first major aim of this project is to enhance the metadata from all of the studies involved by creating metadata for every single questionnaire. These questionnaires start as early as the 1940s and vary in content and methodology. Overall, within the project’s scope there are:
- more than 200,000 survey questions
- more than 300 data collection tools in around 100 ‘sweeps’
- around 90 validated survey instruments and 20 validated clinical measures, as well as a range of cognitive or physical tests.
Introducing: Caddies
In order to facilitate this work, the CLOSER team members from the Institute of Education have developed the CLS Abridged DDI Editor for Surveys, or Caddies. For those with a technical bent, behind Caddies is a SQL server database with a Ruby on Rails user interface for metadata entry. The tool allows for the inputting and checking of questionnaire metadata, described according to the DDI Lifecycle standard. Having been created to work with DDI-L version 3.1, it will shortly be updated to produce metadata compliant with version 3.2, ensuring the metadata entered can be exported into other DDI-compliant software for further work. Alongside this, work has been completed on production of a metadata profile – the technical specification of the CLOSER implementation of DDI-L. This is where we specify which fields we will be using and the details of what will be entered into them.
Caddies tries to make the job of inputting metadata simple. Whilst some of the more recent surveys might exist in a form suitable to be transformed, many of them will not. Staff will soon be in place to begin this work in earnest.
Looking to the future
This is the first step towards the goals of the CLOSER Uniform Search Platform. Following this, there will be work required to harmonise and categorise this new metadata. Links will need to be made between questions both within a study (across-sweeps) and across studies. This metadata enhancement work will run in parallel to development of the search platform itself.