Full Program »
AI in the context of generalist data repositories: the Harvard Dataverse case.
Advances in generative AI research and the development of large language models are revolutionizing how we interact with digital devices, applications, and computers at an incredible pace. This overview first revisits basic concepts of generative AI in the context of natural language processing and large language models (LLMs). We then discuss potential applications for generalist data repositories, highlighting both the risks and opportunities of adopting these technologies. Specifically, we explore the implementation of these ideas in the Harvard Dataverse repository, including semantic search, automatic data curation support, interactive data exploration through natural language queries, data augmentation, and knowledge graph construction. We also examine the performance of commercial and open-source models in these tasks, explaining our development approach using open-source models. While our applications are focused on the Harvard Dataverse repository, the techniques and the methods we present are portable to any other data repository. References to open source code will be made available.