CLOSER Data Discoverability team to attend IASSIST 2025

CLOSER's Data Discoverability team will attend and contribute to this year's IASSIST conference which will run from 3-6 June 2025 in Bristol, UK.

About the conference

IASSIST 2025 would love for you to join us in Bristol, United Kingdom, for its golden anniversary conference from June 3 to June 6, 2025, to engage in the past, present, and future of data services, including data management and technologies.

IASSIST (International Association for Social Science Information Services and Technology) is an international organisation of professionals working with information technology and data services to support research and teaching.

The conference will be held in-person, centering on networking opportunities and interaction.

Further details about the conference, including the current programme and registration, can be found on the IASSIST website.

CLOSER’s presence at the conference

Our Data Discoverability team will deliver a presentation and poster, as well as be involved in a panel discussion and workshop. Further details can be found below:

Presentation – Building a community around DDI: The European DDI Users Conference

Jon Johnson (CLOSER), Joachim Wackerow (Independent Consultant), Mari Kleemola, (Finnish Social Science Data Archive)

The annual European DDI Users Conference (EDDI) was established in 2009 to bring together users of the DDI standards, to exchange ideas, and support capacity building loosely based around the IASSIST model. The DDI (Data Documentation Initiative) standards are open standards for describing and managing data from social, demographic, economic and health sciences. This presentation will outline the development of the Conference, and how it has evolved to reflect the changing needs of data producers, managers and disseminators.

The presentation will also discuss the challenges of bringing together a diverse community of metadata and data producers and users, where EDDI has succeeded and where there is room for improvement.

Poster: Unlocking the potential of standardised scales through metadata

Claudia Alioto, Becky Oldroyd, & Jon Johnson (CLOSER)

Standardised scales, also known as summated scales or validated questionnaires, are a group of related questions that measure an underlying concept. These scales are valuable research tools as they are cost-and time-effective to implement, and allow researchers to reliably measure concepts across samples and over time. The use of standardised scales also enhances the comparability of research data.

However, finding information (i.e. metadata) about standardised scales is challenging and time-consuming. Information is typically scattered across multiple sources and documents, if available at all, and permission is sometimes required to access the scale. Additionally, some scales have multiple versions that include a subset of the original items, and it can be difficult to trace these versions back to the original. It is also difficult to find information about where these scales have been used in existing research.

CLOSER aims to address these challenges by making information on multiple standardised scales openly accessible to researchers. We have gathered and documented up-to-date, comprehensive metadata about the scales used in the CLOSER Discovery studies in one centralised, publicly available platform: CLOSER Discovery. Users can now find information about the name, citation, question items (both the original and other versions), topics measured (e.g. alcohol consumption, physical health, depression), and their usage in the CLOSER Discovery study questionnaires and datasets for 10 standardised scales. We are preparing to add metadata for an additional 10 scales to CLOSER Discovery in early 2025. To our knowledge, CLOSER Discovery is the only platform providing such detailed metadata on standardised scales, enabling researchers to identify where scales are used both within and across studies.

This poster will describe the process of creating comprehensive standardised scale metadata, the benefits of this metadata for researchers, and our future plans.

Panel discussion – Back to the rough ground! Retrieving concepts in survey research and its potential uses

Deirdre Lungley (UKDS), Suparna De (University of Surrey), Chandresh Pravin (University of Surrey), Jon Johnson (CLOSER), Paul Bradshaw (ScotCen)

Survey design and fielding of questionnaires exerts significant effort into asking the right questions to elicit high quality data from respondents. Yet as a researcher coming to data from archives much of this information is lost or locked up in PDFs that is burdensome to use and a barrier to the ambitions of FAIR.

The technical capability to serve up such metadata is well served by standards such as the suite of DDI standards. Populating such schemas at scale will however need a step change in the way metadata is utilised in the data lifecycle. The absence of high quality question banks and paucity of ‘this is how you do it’ projects are demotivating factors for adoption.

The ESRC Future Data Services pilot project between CLOSER, University of Surrey, UK Data Service and Scotcen is tackling these issues, utilizing the CLOSER metadata repository as a training (meta) data set to develop novel machine learning approaches to the extraction of metadata from survey questionnaires, conceptual extraction and alignment of questions and the use of concepts to drive machine actionable disclosure assessment.

The presentation will report on progress in these three areas.

Workshop – AI-enabled data practices for metadata discovery and access: Best practices for developing training data

Wing Yan Li (University of Surrey), Chandresh Pravin (University of Surrey)

Continued investment into new and existing data collection infrastructures (such as surveys and smart data), highlights the growing need for creation of efficient, robust and scalable data resources which help researchers find and access data. Recent advances in artificial intelligence (AI) methods to facilitate automatic analysis of large text collections provides a unique opportunity at the intersection of computational techniques and research methodologies for the development of data resources that are able to meet the current and future needs of the research community.

With the widening application of AI and machine learning (ML) pipelines for processing large text corpora, this workshop focuses on a fundamental prerequisite before setting up any pipeline for downstream tasks: the Dataset. It is a common perception that ML models are data hungry and require a vast amount of data to enhance model performance. While understandable, this perception can sometimes overshadow the importance of data quality. In collaboration with CLOSER, this workshop will cover a typical “packaging” of data to train and evaluate models. The workshop will explore various aspects that contribute towards good practice for creating quality training datasets, including exploratory data analysis, selection of evaluation metrics, model selection and model evaluation.

Conventionally, models are evaluated quantitatively, as represented by the appropriate metrics, and qualitatively. While it might be tedious to qualitatively analyse all the samples, random sampling could be problematic. In the section covering model evaluation, workshop participants will be introduced to the problem of data biases and gaps. By bridging technological approaches with social science research needs, this workshop offers an exploration of data transformation techniques that enhance research reproducibility and computational analysis capabilities.