AI-enabled data practices for metadata discovery and access: Best practices for developing training data

Continued investment into new and existing data collection infrastructures (such as surveys and smart data), highlights the growing need for creation of efficient, robust and scalable data resources which help researchers find and access data. Recent advances in artificial intelligence (AI) methods to facilitate automatic analysis of large text collections provides a unique opportunity at the intersection of computational techniques and research methodologies for the development of data resources that are able to meet the current and future needs of the research community.

With the widening application of AI and machine learning (ML) pipelines for processing large text corpora, this workshop focuses on a fundamental prerequisite before setting up any pipeline for downstream tasks: the Dataset. It is a common perception that ML models are data hungry and require a vast amount of data to enhance model performance. While understandable, this perception can sometimes overshadow the importance of data quality. In collaboration with CLOSER, this workshop will cover a typical “packaging” of data to train and evaluate models. The workshop will explore various aspects that contribute towards good practice for creating quality training datasets, including exploratory data analysis, selection of evaluation metrics, model selection and model evaluation.

Conventionally, models are evaluated quantitatively, as represented by the appropriate metrics, and qualitatively. While it might be tedious to qualitatively analyse all the samples, random sampling could be problematic. In the section covering model evaluation, workshop participants will be introduced to the problem of data biases and gaps. By bridging technological approaches with social science research needs, this workshop offers an exploration of data transformation techniques that enhance research reproducibility and computational analysis capabilities.

Wing Yan Li
University of Surrey
United Kingdom

Chandresh Pravin
University of Surrey
United Kingdom