Exploring Natural Language Search for Data Retrieval using Large Language Models (LLMs)
Natural language search is increasingly important in data retrieval, as it enables users to search for data using everyday language. In an experiment, we're leveraging Large Language Models (LLMs) to enable users to search for data in natural language within Statistics Netherlands' Data Service Centre (DSC). We're utilizing a combination of sources, including dataset and variable descriptions from the DSC, tips and tricks scraped from the intranet, and concepts from the Statistics Netherlands rdf store.
A subset of DSC data is formed by data from the System of Social statistical Datasets (SSD): a comprehensive repository of microdata covering various aspects of people's lives, such as health, education, work, relationships, crime, and social benefits. Our goal is to develop an LLM-based search functionality that enables researchers to retrieve relevant variables from the SSD in a more intuitive and efficient manner.
Our approach, agentic Retrieval Augmented Generation (RAG), involves embedding textual infor