Full Program »
Why do we need metadata, can we just use ChatGPT? Methods in metadata discovery and documentation using Generative AI
Since its launch in November 2022, ChatGPT has become a focal point for data analytics and discovery with many researchers, students, policy-makers and layman all turning to Generative AI to assist with documentation and interpretation of data. In this new world, the question arises - why do we need metadata if AI can fill in the gaps in our own understanding.
While data and metadata practitioners understand that the context and nuance of data understanding are dependent on timely, high quality documentation there remains a bridge between best practice and real-world scenarios. To bridge this gap, data librarians and data governance areas must be able to communicate to stakeholders the value of good metadata, while also integrating new technologies. Given the reliance on emerging cloud or software-as-a-service technologies, there must also be a focus on educating users on how to preserve data privacy when using Generative AI tools to minimise data leakage.
In this presentation, we look at how Generative AI tools such as ChatGPT and data analytics tools like Pandas can be used to augment data documentation and reduce the burden on data custodians without reducing data quality by exploring the following topics:
* Methods in developing and running local language models
* Retraining language models using metadata resources
* Matching, summarising and generating data to generate metadata
* Privacy-perserving methods in publishing auto-generated content
This is done using Aristotle Activate, an open-source database scanning and metadata linkage tool designed to interact with the Aristotle Metadata Registry to retrain and refine language models, as well upload and publish linked and generated metadata.