Data discovery made easy: helping less-expert users find data via a large language model
The "Data Discovery Made Easy" project (DDME) is funded by the UK Economic and Social Research Council as part of their "Future Data Services (Pilots)" programme. It is adding a prototype natural language search interface to the existing web site A Vision of Britain through Time. This is a public interface to the Great Britain Historical GIS (GBHGIS), a large Postgres/PostGIS database holding data from every British census 1801-2021, diverse other statistics including vital registrations and the farming census, and digital boundaries for most of the ever-changing reporting geographies.
We argue that existing data services have become too focused on the needs of data scientists, investing substantial time in learning to navigate download systems. We focus instead on mainstream social scientists and others like journalists and policy analysts, often seeking just one local time series or even a single data value. The GBHGIS holds all statistics in a single central data store, but the diversity of content and the enormous complexity of Britain's statistical geographies makes data discovery challenging.
DDME aims to provide a more user-friendly way to access statistics, with the tool acting as a bridge between the non-specialist user and the data repository. This presentation will sum up the results of DDME, including how we use Large Language Models (LLMs) with software packages like LangChain and LangGraph to produce a graph-based process to guide the language model to the desired data, mitigating the likelihood of AI hallucination. This software also allows DDME to be LLM-agnostic, meaning we can easily change from OpenAI’s service once locally hosted models become more viable. After a subset of the data has been selected by the user and tool, the DDME interface is able to choose how to present the data to the user, generating appropriate interactive plots on demand.