The archive provides titles, presenters, abstracts, and links to presentations in Zenodo and recordings on YouTube, where available. If the list does not include any links, it means the presentation was not obtained by IASSIST. However, it may be available online, if the authors have published it elsewhere.
All recent archives are organized in the following order: plenaries, concurrent sessions, lightning talks, posters, and workshops. Abstracts can be viewed by clicking the button below.
June 4 and June 5, 2026: Plenaries
Plenary 1 (Wednesday): no title available
Dr. Shawn Sobers, (University of the West of England, UK)
no abstract available
Plenary 2 (Thursday): Why are we here? What should we do? Why does it matter? OR Building an ecosystem one step at a time
Data Management from Scratch: Establishing a New Data Management Program at Harvard University
Sarah Marchese (Harvard University)
With the exponential growth of research data driven by technological advancement, comes the increased need for greater expertise on how to better manage, organize, store and share research data. But what is the best strategy for setting up a new Research Data Management program at your university? This presentation will discuss how the Research Computing group at Harvard University's Faculty of Arts and Sciences established and developed a new Research Data Management program in the spring of 2024. We will examine how a community survey sent to faculty, researchers, and staff helped inform the group's greatest gaps and needs and helped to define the immediate and long-term priorities for the program. We will then outline some of the challenges we've since experienced, including a backlog of inactive data storage, new federal regulations and guidelines around data retention, and the introduction and adoption of data management concepts and policies both internally with colleagues and externally with faculty and staff. Finally, we will highlight future plans for the program, including increased documentation and data visualization tools, potential team growth, and a greater cohesion and collaboration with other groups across the campus (IT, Library, Departments).
Iscte Research Data Management Implementation Strategy
Clara Boavida (Iscte - Instituto Universitario de Lisboa)
Iscte – Instituto Universitário de Lisboa is a public higher education institution in Portugal committed to providing the infrastructure and policy framework necessary for its scientific community to uphold the principles of open research. The Iscte has an Open Access Policy since 2009 (updated in 2015), and a Research Data Management and Sharing Policy approved in 2023. The Iscte Repository has been running since 2007, and in 2013, it became interoperable with the current research information system – Ciência-IUL – and with the data repository - Zenodo. This interoperability allows to link research results, including scientific publications, research data and data management plans, with the funding grant interoperable with OpenAIRE. The workflow in operation is the result of the synergy between services, whose implementation criteria guarantee compliance with the FAIR principles. For the storage of raw data storage, Iscte has a dedicated data centre designed to be flexible and scalable. Recently, Iscte integrated the Re.Data Consortium, approved in November 2024. In articulation with the Consortium, ten Research Data Management (RDM) Centres have been established in Portuguese Research Performing Organisations to ensure the capacity building of the scientific community, the improvement of the supported infrastructure, and the increase of the number of datasets published in trusted repositories. Within Re.Data, Iscte will lead the operationalisation of the Portuguese Network of Data Stewards, and the establishment of a robust helpdesk service, enhanced by AI automation, to ensure efficient and responsive support to all users of the community, leveraging the expertise of the Consortium and the Competence Centres. This presentation will not only describe the Iscte RDM strategy, the challenges and barriers encountered in its implementation, but also the work carried out during the first six months of the consortium, focusing on the involvement of the community in contributing to the activities described above.
Supporting the Canadian DMP landscape: An overview of DMPEG activities and a new DMP Assessment Rubric!
James Doiron (University of Alberta)
The Digital Research Alliance of Canada's Data Management Planning Expert Group (DMPEG) develops and delivers publicly available DMP-related guidelines, best practices, content, and resources for supporting researchers and research excellence across Canada, including DMP templates, examples, and guidance materials. DMPEG also supports the ongoing development, maintenance, and sustainability of DMP Assistant, a freely available bilingual web-based tool providing templates, questions, and guidance for supporting researchers with their DMP needs. This session offers an overview of DMPEG activities and resources, notably highlighting a new DMP assessment rubric developed specifically to support researchers in developing quality DMPs and meeting requirements at the funding application stage, including those implemented by the Tri-Agency, Canada's national funders of research. An overview of the assessment rubric and its content, along with an accompanying Simplified DMP template that it is standardized to, will be provided. Information regarding DMP Assistant, including its key features and how to access and use it will also be provided, along with additional resources developed by DMPEG, including DMP examples. Future work and directions will additionally be discussed.
Transforming the Library into a Trusted Research Environment: unlocking access to sensitive data at the LSE Library
Hannah Boroudjou (London School of Economics (LSE))
Trusted Research Environments (TRE) are physical or digital environments designed to allow access to confidential data. Access to this data is core to social sciences research, which is why securing access has emerged as such as a challenge at LSE, a specialist institution with a focus on the social sciences. The solution has been the development of the Datalibrary service at LSE, which has worked to meet rapidly rising demand for access to secure data, and worked with partners to develop various TRE solutions, both physical and digital. The team is also an integral part of the licensing and application process, liaising on behalf of researchers with our Legal, Cyber Security, and Data and Technology Services. What makes the Datalibrary at LSE unique is our focus on widening access to confidential data outside of the Academy, in line with our purpose as a National Research Library. This has principally been done via the SafePod service, delivered in collaboration with the ESRC funded SafePod Network (SPN). However, the SafePod has been overbooked since the closure of the ONS Safe Rooms in 2021, which left the LSE as one of the only access points to secure data in London. In response, we are working with the SPN to expand our offer via the opening of the SafePoint Hub, which is due to open in 2025. In this session we'll look at how the library has developed, the practicalities for space and financial resourcing, how we've worked with our partners to develop the service and how we plan to expand in future. We'll look at the challenges we've faced in delivery, both in terms of the financial costs and staff time, the lessons we've learned along the way and will share advice for anybody looking to develop a similar service in future.
Data discovery made easy: helping less-expert users find data via a large language model
Humphrey Southall (University of Portsmouth)
Xan Morice-Atkinson (University of Portsmouth)
Paula Aucott (University of Portsmouth)
The "Data Discovery Made Easy" project (DDME) is funded by the UK Economic and Social Research Council as part of their "Future Data Services (Pilots)" programme. It is adding a prototype natural language search interface to the existing web site A Vision of Britain through Time. This is a public interface to the Great Britain Historical GIS (GBHGIS), a large Postgres/PostGIS database holding data from every British census 1801-2021, diverse other statistics including vital registrations and the farming census, and digital boundaries for most of the ever-changing reporting geographies. We argue that existing data services have become too focused on the needs of data scientists, investing substantial time in learning to navigate download systems. We focus instead on mainstream social scientists and others like journalists and policy analysts, often seeking just one local time series or even a single data value. The GBHGIS holds all statistics in a single central data store, but the diversity of content and the enormous complexity of Britain's statistical geographies makes data discovery challenging. DDME aims to provide a more user-friendly way to access statistics, with the tool acting as a bridge between the non-specialist user and the data repository. This presentation will sum up the results of DDME, including how we use Large Language Models (LLMs) with software packages like LangChain and LangGraph to produce a graph-based process to guide the language model to the desired data, mitigating the likelihood of AI hallucination. This software also allows DDME to be LLM-agnostic, meaning we can easily change from OpenAI's service once locally hosted models become more viable. After a subset of the data has been selected by the user and tool, the DDME interface is able to choose how to present the data to the user, generating appropriate interactive plots on demand.
Optimising the UK Longitudinal Linkage Collaboration researcher journey through the development of inter-operable data discoverability, data documentation and data access systems
Richard Thomas (UK LLC, University of Bristol)
Katharine Evans (UK LLC, University of Bristol)
Rachel Calkin (UK LLC, University of Bristol)
Abigail Hill (UK LLC, University of Bristol)
Stela McLachlan (UK LLC, University of Edinburgh)
Emma Turner (UK LLC, University of Bristol)
Robin Flaig (UK LLC, University of Edinburgh)
Andy Boyd (UK LLC, University of Bristol)
UK Longitudinal Linkage Collaboration (LLC) is the national Trusted Research Environment (TRE) for the UK's longitudinal research community. LLC integrates data from many UK Longitudinal Population Studies (LPS) and systematically links participants' health, environmental and non-health socio-economic records, into a centralised TRE. Co-locating many LPS' datasets and including linked routine records enables a highly diverse UK-wide sample, increases overall statistical power to investigate 'rare' exposures/outcomes and includes seldom heard population sub-groups. However, the breadth of data raises a substantial data discovery, selection and inference challenge. To enable LLC to effectively support its users, we have developed a multi-layered FAIR (findable, accessible, interoperable, reusable) system to optimise the researcher journey and to support users to identify and understand the data they need for their research. LLC collates internal LLC metadata and draws metadata from data owners and related metadata infrastructures via Application Programming Interfaces and then surfaces the metadata into a component of the system. First, LLC Explore (https://explore.ukllc.ac.uk/), a web-based data discoverability tool, provides search functionality – now being integrated with a large language model to enhance performance – with advanced filtering to enable researchers to identify the data items most suited to their research question and to build a data request. Second, LLC Guidebook (https://guidebook.ukllc.ac.uk/) contains the documentation and metrics needed to understand the provenance of the data and how these data have been impacted by the linkage and de-identification processes. Third, integration of LLC Explore's data request with the bespoke LLC application management system facilitates rapid application review with automated project-level data provision. Finally, our GitHub repositories and associated processes support users to deposit and document reusable research resources (e.g. syntax and code lists) into a community archive. This multi-layered system is designed to provide a high-quality user experience, whilst maximising the value of existing LPS infrastructure initiatives.
Session theme/info: Consultation, Literacy, and Librarianship
Exploring student motivations to engage with voluntary data literacy library workshops
Kelly Schultz (Univeristy fo Toronto)
Due to the nature of voluntary data literacy library instruction, such workshops often struggle to achieve high enrollment and attendance. Looking to understand this phenomenon better, a qualitative study was conducted to gain insights into learner motivations and challenges when engaging in voluntary data literacy instruction offered by the Map & Data Library at the University of Toronto. As part of this investigation, workshop titles and descriptions were explored to see what impact they had on enrollment motivation. This presentation discusses some of the themes uncovered through this research, provides suggestions on how to address these findings, and presents some of the ongoing work done since to improve library data literacy workshop engagement.
Bridging the Gap: Inclusive Approaches to Data Literacy Across Disciplines
Di Yoong (Carnegie Mellon University Libraries)
Emma Slayton (Carnegie Mellon University Libraries)
Charlotte Kiger (Carnegie Mellon University Libraries)
Buzzwords like "Open Science" and "Big Data" often capture attention but risk alienating those in the humanities and social sciences who may not see their work as data-driven. Simultaneously, STEM students' confidence in handling quantitative data can mask gaps in their understanding of data literacy as a broader concept. At Carnegie Mellon University Libraries, we address these challenges by bridging the divide between STEM-centric and humanistic data literacy, equipping researchers with skills to engage effectively with data science across disciplines. This paper explores data literacy through an interdisciplinary lens, emphasizing the importance of qualitative and textual data practices alongside quantitative approaches. By recognizing that all data—no matter how abstract—is socially constructed, we challenge the notion that data literacy is solely quantitative. Our strategies include developing pedagogical tools that address epistemological differences, supporting humanities researchers in technical data methods, and broadening STEM students' narrow definitions of data literacy to encompass the entire data lifecycle. Drawing on our expertise across disciplines, we have led initiatives to contextualize data literacy within humanities, social sciences, and STEM, fostering cross-disciplinary engagement. Our efforts focus on: Recognizing diverse disciplinary communicative practices, Developing flexible, adaptable pedagogical approaches, Decentering STEM-centric models of data engagement, and Emphasizing the role of qualitative research in data science. Through comparative analysis of institutional experiences, we highlight the importance of crafting inclusive, accessible educational approaches that respect diverse learning modes and prior experiences. Our work demonstrates how librarians can develop frameworks to empower researchers and students across academic communities, influencing university policy and fostering data fluency for all. This presentation provides a roadmap for integrating these principles into data literacy programs at other institutions.
Delivering Data Education: Bridging communities and anchoring library data services
Julie Goldman (Harvard Library)
The Countway Research Data Services (RDS) Librarian was a newly created position in 2017 to determine and prioritize the RDM services most applicable to the local Longwood Medical Area (LMA) community. As a member of the "Publishing and Data Services" team, the Countway RDS Librarian serves as a subject expert, providing direct support to biomedical and public health researchers in navigating the data landscape.
A key activity of the RDS Librarian is teaching regular Research Data Management Seminars on data management topics. In 2018, the Data Management Seminar Series was launched, offering seven in-person classes and establishing an official program for data education at Harvard. Due to the COVID-19 pandemic, the seminars became webinars, allowing members outside the LMA to learn about RDM. The introduction of online classes has allowed the RDS Librarian to teach more classes to more participants. For example, in 2020 the program offered 38 classes to over 1,500 students. While this level of programming was not sustainable, the RDS Librarian found a balance over the last four years for consistent open data education for Harvard constituents and beyond. This lightning talk will explore the development and sustainability of the RDM Seminar Series, provide insights on partnerships and collaborations across the university, and how the program responds to the needs of the research community in order to provide high level data literacy education to an diverse, expert audience.
Session theme/info: Partnerships and Collaboration
The European Language Social Science Thesaurus (ELSST): Keeping it FAIR
Sharon Bolton (UK Data Service)
Jeannine Beeken (UK Data Service)
Lorna Balkan (UK Data Service)
Ami Katherine Saji (PROGEDO)
Róza Vajda (TK KDK)
The European Language Social Science Thesaurus (ELSST) is a comprehensive multilingual thesaurus designed for the social sciences. ELSST is available in 15 languages, and facilitates research discovery worldwide, supporting access to social science data regardless of language or domain. ELSST is owned and developed by the Consortium of European Social Science Data Archives (CESSDA). This presentation will focus on the collaborative nature of ELSST, demonstrating how CESSDA partners work together to ensure ELSST reflects the changing nature of international, multilingual, social science research. It will also cover new ELSST developments that increase its profile and value as an easy-to-use, FAIR social science metadata vocabulary tool, such as the versioning of concepts, searchable linkage between ELSST and other CESSDA services, and updates to subject coverage of changing areas such as working patterns, education, family structure, and digital technology.
FIDELIS: Uniting the European digital data repository community beyond finite project efforts
Mari Kleemola (Finnish Social Science Data Archive, Tampere University)
Tuomas J. Alaterä (Finnish Social Science Data Archive, Tampere University)
Maaike Verburg (Data Archiving and Networked Services)
Providing open access to research data increases the reliability and efficiency of the scientific process, as it allows research results to be reproduced, replicated, and re-used. The European Open Science Cloud (EOSC) develops a web of FAIR (Findable, Accessible, Interoperable, Reusable) data and services for science in Europe. Trustworthy Digital Repositories (TDRs) providing long-term preservation services are essential for achieving a sustainable EOSC. The three-year EU-funded FIDELIS project, which started in January 2025, aims to establish a healthy, vibrant, and self-sustaining network of TDRs that will foster a supportive open science environment and guarantee FAIR data sharing also in the future. FIDELIS will also advance harmonisation and interoperability across repositories, strengthen the repositories and enable expansion of the network through active training and support programmes. This talk will build on our vision of how a TDR network could unite the repository community beyond ambitious but finite project efforts and offer a coordination and networking mechanism for addressing common challenges like sustainability. We will present preliminary results from a repository landscape analysis survey, provide an overview of the characteristics, functions, and activities exhibited across a broad range of repositories, and discuss how we can reach a clear and shared consensus on what constitutes a TDR in the context of EOSC. The talk will also describe how FIDELIS will engage with the community of repositories that are diverse in terms of maturity, domain, geographical location, and targeted users. We will conclude with an introduction to the project's main routes for upskilling repositories and assisting them in achieving trustworthiness: cascading grants, training, and mentoring programmes.
EOSC-ENTRUST's Driver 2: A prototypic use case for federated, multinational use of TREs
Beate Lichtwardt (UKDS/UKDA, University of Essex)
Sharon Bolton (UKDS/UKDA, University of Essex)
Deborah Wiltshire (Gesis)
Peter Hegedus (TARKI)
The EOSC-ENTRUST project aims to create a 'European Network of TRUSTed research environments' for sensitive data and to drive interoperability by developing a common blueprint for federated data access and analysis. Data governance, legislation and management all differ between its international partner institutions. How to reconcile these differences to enable the wider sharing of sensitive data, is at the heart of this project. EOSC-ENTRUST Driver 2 represents one of four use cases, which are prototypic for federated, multinational use of TREs in research practice across scientific domains and user communities. Our presentation will outline Driver 2's use case, where we are testing and building on a framework established under the SSHOC project for implementing trans-national data sharing agreements, enabling and setting up new remote access connections.
Revealing the Impact of Research Data: The Experience from Data Organizations and Initiatives
Finn Dymond-Green (UK Data Service)
Diana Magnuson (University of Minnesota, Institute for Social Research and Data Innovation (ISRDI))
Iratxe Puebla (Make Data Count)
Elisabeth Shook (ICPSR at University of Michigan Institute for Social Research)
Elizabeth Mos (ICPSR at University of Michigan Institute for Social Research)
This panel brings together practitioners who will present strategies for evidencing, promoting, and understanding the impact of research data. Speakers will address unique challenges and solutions in their domain, offering a view of current practices and future directions. Finn Dymond-Green, representing the UK Data Service, will discuss the organization's role as a nationally funded infrastructure connecting data producers with researchers and analysts. Finn will discuss their approaches and channels for evidencing and promoting the impact of data, creating a narrative which builds on and extends beyond data citation. Diana L. Magnuson will present the work of the IPUMS Bibliography team in managing a growing bibliographic database. The challenge for the IPUMS Bibliography team: develop an efficient and effective workflow for managing the expanding bibliographic database and provide a usable web interface for internal and external users to discover and access publications using IPUMS data. Iratxe Puebla from Make Data Count will discuss their work to promote the development of data metrics to enable evaluation of data usage. She will share updates from projects to scale the data usage information available to the community, and to enhance the context of usage measures. This includes the Data Citation Corpus as a central open aggregate for all data citations, as well as a community-led effort to develop a typology of data usage types. Elisabeth Shook and Elizabeth Moss will discuss ICPSR's work in linking research data with scholarly literature through the ICPSR Bibliography of Data-related Literature. They will explore the necessity of human curation, the limitations of persistent identifiers in capturing nuanced data use, and the challenges of integrating meaningful linkages within automated systems. Attendees will understand efforts to improve discoverability and impact of research data in an evolving digital environment and gain actionable insights to implement in their own institutions.
SAPRI: advancing marine and polar research through collaborative data management
Anne Treasure (South African Polar Research Infrastructure (SAPRI))
The southern polar region is a system of interconnected physical and ecological components comprising the Antarctic continent, sub-Antarctic islands, the Southern Ocean and deep ocean basins surrounding South Africa. The Antarctic region is hence climatically, ecologically and socio-economically linked to South Africa, and the vast range of disciplines require a holistic approach. The vision of the recently established South African Polar Research Infrastructure (SAPRI) is to facilitate balanced and transformed research across polar disciplines, and to maintain and further expand the world-class, long-term research infrastructure and datasets already established in South African polar and marine research. SAPRI aims to advance this research by fostering interdisciplinary collaboration and efficient data management. By integrating diverse research domains, which includes Humanities and Social Sciences (HASS), SAPRI facilitates a holistic understanding of marine and polar environments and their societal impacts. SAPRI addresses the critical need for shared, high-quality data resources to support research in these remote and rapidly changing ecosystems. Data is handled by the SAPRI Data Centre (DC) hosted at the South African Environmental Observation Network uLwazi Node. The SAPRI DC maintains current best practices in relation to its repository management functions and related systems, and applicable community standards for data management are adopted. Through robust data platforms, SAPRI aims to ensure that researchers across disciplines can access, analyse and contribute to data collections that inform pressing issues. SAPRI also supports the human dimensions of polar and marine studies, enabling exploration of cultural, historical and geopolitical perspectives tied to these regions. By bridging scientific and societal concerns, SAPRI aims to promote cutting-edge research and empower informed decision-making for South Africa's marine and polar stewardship. This collaborative approach underscores the importance of integrating HASS into the traditionally natural science-focused realm of polar and marine research, fostering a comprehensive understanding of these critical global systems.
Book-a-Data-Manager (BADM) RDM Service– A practical approach to RDM support
Naeem Muhammad (KU Leuven)
Veerle Van den Eynden (KU Leuven)
Johan Philips (KU Leuven)
The Book-a-Data-Manager (BADM) is a centrally managed, paid Research Data Management (RDM) support service at KU Leuven, designed for large research projects and groups. The service, comprised of a pool of data managers, offers up to 10 days of consultation to research teams, assisting them in developing and implementing good data management practices. While the data managers, typically drawn from various domain-specific RDM teams, provide the required RDM support guidance, they are not responsible for managing the data for the research group or project. The research team is ultimately responsible for carrying out and following up on data management. BADM's support focuses on key areas such as developing data storage procedures, improving data organization, documenting data and metadata, and creating workflows for archiving and publishing research data at the end of a project. The service delivers tailored RDM solutions through active engagement between researchers and RDM experts. BADM was launched as a pilot in 2023, addressing researchers' demand for a need for a short-term, in-depth support for research data management. During this phase, the service assisted 13 research groups: 6 from Humanities and Social Sciences, 6 from Science and Technology, and 1 from Biomedical Sciences. Feedback from the involved researchers highlighted the value of the solution-driven approach and the comprehensive support provided. Positive evaluations led to BADM's transition from a pilot program to a standard service at KU Leuven in June 2024. The successful operation of a service like BADM requires collaboration among various institutional services and RDM teams, necessitating a robust governance model for optimal execution. This presentation will outline the operational framework of BADM at KU Leuven and discuss both the benefits and challenges of managing such a service within an academic environment.
All hands-on desk – Data stewards to the rescue
Auriane Marmier (FORS)
Since the early 2000s, the open science movement has transformed the research landscape, pushing academics to embrace greater transparency, data sharing, and reproducibility. However, these practices have brought new challenges, particularly in balancing openness with data protection laws and ethical considerations. Researchers now face increasing demands to address issues such as drafting data management plans (DMPs), ensuring data security, and facilitating data reuse. In response, new roles have emerged, with the data steward (DS) standing out as pivotal. DSs support the entire research data lifecycle, from creation and storage to sharing and preservation. Yet, formalized training for DSs remains scarce, with most practitioners coming from diverse backgrounds such as library science or research, without standardized certification. To address this gap, the University of Lausanne launched Switzerland's first Certificate of Advanced Studies (CAS) in Data Stewardship in October 2024. This CAS offers a cross-disciplinary core module, complemented by specialized tracks in life sciences and social sciences. As a key partner, the Swiss Centre of Expertise in the Social Sciences (FORS) naturally contributed to the cross-disciplinary module and took the responsibility of designing and implementing the social sciences module. In this presentation, we will focus on the social sciences module, sharing our experience in designing training tailored to a not yet well-defined profession and participants with different backgrounds and experiences. We will also highlight the challenges encountered during its implementation as well as the benefits that a specialized CAS can offer. We believe that the lessons learned from our pilot can be valuable to practitioners involved in RDM practices, particularly in relation to the new material developed for this data stewardship certification. These insights could also guide other institutions in developing training and certification programs, and in shaping the future of data stewardship in research.
June 4, 2025: Concurrent Session B.2
Session theme/info: Data Discovery
Geolinking Service SoRa – Development of the Social-Spatial Science Research Data Infrastructure: FAIR, Smart, Inclusive
Jan Goebel (DIW Berlin / SOEP)
Benjamin Zapilko (Gesis)
Theodor Rieche (IOER)
Jonas Lieth (Gesis)
Stefab Jünger (Gesis)
Alexander Jung (DIW Berlin / SOEP)
The interdisciplinary linking of social and spatial data holds great potential but faces significant infrastructural, legal, and methodological challenges. The SoRa+ project aims to develop a sustainable and privacy-compliant research data infrastructure that bridges this gap, enabling the seamless integration of social science survey data with georeferenced spatial data. Building on established infrastructures such as the IOER-Monitor, ALLBUS, and SOEP, SoRa provides researchers with tools to explore complex, interdisciplinary questions at the intersection of social and spatial sciences. A key innovation of SoRa is its support for the entire analysis process involving geocoded survey data. This includes data preparation, linking, and analysis workflows, all within a secure environment. The infrastructure's components include a user module, a public/private API, and a Geolinking API. Prototypes for these components have been developed, along with R and Stata packages for facilitating efficient research workflows. Several geolinking methods will be offered, like point-to-point, buffer or isochrone-based linking, or routing to nearest Point of Interest (such hospital, school or doctor). Additionally, SoRa enables researchers to prepare their analysis by using structural datasets that replicate spatial distributions without exposing sensitive data. These datasets, tailored to specific survey datasets, allow workflow testing and initial exploration before accessing secure environments. This presentation will showcase the project's progress, from defining infrastructure components to developing prototypes and implementing privacy-preserving features. If possible, a live demo will be shown that demonstrates how SoRa can be used to enhance social-spatial research applications. By advancing FAIR, smart, and inclusive data linking, SoRa sets a robust foundation for interdisciplinary social-spatial research, empowering researchers to address innovative and data-driven questions effectively.
Unlocking the past: increasing discoverability of archive primary sources
Magaly Taylor (Gale)
The value of primary source archives for research and learning is undeniable. Whole periods of our history are concealed in thousands of pages of manuscripts, images, and publications. The transformative power of digitization has significantly increased the availability of this material worldwide, inspiring a new era for instructors and researchers in the humanities, social sciences, and more who want to incorporate historical collections into their work. However, the interaction with the primary sources' content is still restricted to their host platforms. The challenge is making primary source archives collections more discoverable in the technological landscape of academic libraries, and metadata is the key factor in making this happen. This presentation will discuss how metadata is the crucial element that makes these collections more discoverable. Using the outcomes of the Gale Primary Sources Metadata for Discovery project, this presentation will include an overview of the current metadata standards and community of practices for archives and metadata for discovery, emphasizing the value and impact on learning and research. This presentation will interest metadata experts, librarians, content providers, and system providers interested in the Archive Primary Source's content and playing a role in making it more discoverable. This presentation will be join presentation with a representative of the library community.
Specifics of metadata development for qualitative social research
Kati Mozygemba (University of Bremen, RDC Qualiservice)
Noemi Betancort-Cabrera (State and University Library Bremen (SuUB))
Qualitative social research is characterized by a high degree of diversity. The research strategies and heterogeneous methods chosen for the respective research questions result in research data with very different characteristics. In addition, qualitative data can often not be shared openly due to the sensitivity of its content, which means that the metadata of qualitative data sets is often central to assessing data fit. Metadata helps to contextualize qualitative datasets and creates transparency for researchers interested in reuse. Because qualitative data often contain multiple cases and associated materials, research data centers such as Qualiservice rely on the presentation of micro-metadata that provide information at the data object and case level. FAIR metadata ensures the networking of data objects and metadata with other platforms, the retrieval of datasets in various portals, accessibility for humans and machines and their interoperability. International standards are central to this. But here too, there are still gaps in the ability to describe qualitative data in such a way that important information is available for subsequent use. Using the example of the Research Data Center Qualiservice research data center and the network QualidataNet, we look at the special features and approaches that have been found to describe qualitative data FAIR. We address three levels: 1) the level of micrometadata for information at case level and object level, 2) the controlled vocabulary QualiTerm, which is intended to fill gaps for the description of qualitative data in metadata standards and 3) the level of the DDI-CDI-qualitative working group, which aims at the reusability of non-numerical data via the integrative description of data objects across different disciplines and is developed on the basis of use cases.
Session theme/info: Consultation, Literacy, and Librarianship
Don't Reinvent the Wheel: Steering Users to Existing Data Literacy Resources
Maggie Marchant (Brigham Young University)
Sydney Lanning (Brigham Young University)
Whether data services are well established and robust, siloed, or just beginning, a common challenge is awareness. Students and faculty on campus often do not know the full scope of available data resources and services, which limits their opportunities to develop essential data literacy skills. This presentation outlines the process that one university used to centralize marketing and access to existing resources for learning data skills. An audit of current data needs at the university revealed that many students need training in cleaning, analyzing, communicating, and understanding data to be prepared to succeed in future research and employment endeavors. Library and campus support services meet some of these needs and could expand services to provide more training. Before venturing into new services, a full review of existing services and resources was conducted. There were many disparate services for research data support and building data literacy. Particularly, semester courses were an under-utilized path for increasing data literacy. These courses already include quality, hands-on data skill training, but awareness is lower because they are spread across different departments and colleges. Courses that teach data skills were identified and visualized in customizable dashboards. This brought the information together and allowed students to direct their own path for gaining data skills, enabling autonomy in the learning process, which improves retention. The data course dashboards were added to a university-branded website along with self-paced, online learning resources for three main types of users—learners, researchers, and decision-makers. The website was designed to be a central place for users to find available data training and resources, simplifying information-seeking and marketing processes. Going forward, feedback from users and stakeholders will be used to identify unmet data needs that can be filled by additional collaboration between librarians and others on campus.
Enhancing data literacy: the UK Data Service Data Skills Framework
Vanessa Higgins (UK Data Service)
The presentation will introduce the UK Data Service Data Skills Framework for quantitative data skills training. https://zenodo.org/records/11110082. The framework aims to establish a robust framework for developing essential data analysis skills for the social, economic and population sciences, focusing on large-scale survey, census, and macro-level aggregate data. It has been developed in the context of a data landscape undergoing unprecedented change at rapid pace. It emphasises continued development of traditional data skills for contemporary research needs, while recognising growing potential for integrating survey or census data with an expanding array of other sources increasingly accessible to social scientists, as well as promising opportunities presented by AI and machine learning for enhancing analysis. The presentation will cover the background and development of the Framework, the methodology, the final content of the framework and the feedback we've had from the community. We will discuss how we are using it within the UK Data Service to aid our own gap analysis of our training programme, training development, and strategic thinking for the next five years, as well as how we are using the Framework to collaborate with other data services. We present a number of potential wider use cases for the framework below and hope it can help in the continued development of a relevant and efficient training ecosystem for data analysis across the social sciences. We envisage it as a live and evolving piece of work, and very much welcome discussion on the content, particularly from the IASSIST international audience.
Jen Buckley (University of Manchester - UK Data Service)
The Open Book Project is an initiative that combines open teaching datasets from the UK Data Service with Jupyter Notebooks. The project has two key aims: to create adaptable teaching resources for educators, and to enhance reproducibility skills among students. The first aim is to develop versatile teaching resources that educators can access and adapt to suit their own needs. The second aim is to address the challenge of building reproducibility skills by identifying the competencies students need to create and use tools like Jupyter or Quarto. By detailing these skills and providing supporting resources, the project aims to prepare students for collaborative and transparent research practices. This presentation will explore the project's approach and outcomes.
June 4, 2025: Concurrent Session B.4
Session theme/info: Partnerships and Collaboration
Rescuing the Metastasio Database: a tale of technology, ontology, and Italian Opera
Kristi Thompson (Western University)
Ronan O'Flaherty (Western University)
The Metastasio Database is an index to the numerous librettos, arias, and other poetic works of prolific Renaissance writer Pietro Metastasio that describes and links their diverse musical settings and relationships, containing over 9,000 records. It comprises much of the life work of Music professor Don Neville at Western University. In 2001, armed with determination, graduate students, and limited technical and metadata knowledge, he developed a searchable web site powered by a MS Access database to allow opera researchers around the world access to this trove of data. In 2023, changing technology and security standards meant that the original site would need to be taken down, and an RDM librarian was asked to find a way to preserve the data and make it available for future generations. After over 20 years of edits and updates, the underlying databases were confusing, highly redundant and inconsistent. A team including a metadata expert, the music library director, and an RDM librarian worked with a library co-op student to clean and organize the data, and a new web site was developed in Omeka to provide access. Key to getting the new database to function was developing a data model and a custom ontology to describe Opera Numeri, the Italian Number Opera form.
Advancing Open Science in Slovenia: The SPOZNAJ Project and Its Role in the Transformation of Research
Sonja Bezjak (Slovenian Social Science Data Archives)
The Slovenian "SPOZNAJ" project aligns with the broader European movement towards open science. In line with European guidelines, Slovenia has enacted key legislation to promote open science: the Scientific Research and Innovation Activities Act (2021), the Resolution on the Scientific Research and Innovation Strategy of Slovenia 2030 (2022), and the Regulation on the Implementation of Scientific Research in Alignment with the Principles of Open Science and the Action Plan for Open Science (2023). These legislative actions require those involved in scientific research to ensure open access to research results, apply responsible evaluation of research work, and engage citizens in research activities. The goal of the SPOZNAJ project is to accelerate the adoption of open science practices among project partners, contributing to the digital transformation of scientific research. The project aims to share knowledge and experiences related to open science and disseminate them to various stakeholders, including public infrastructure institutes, research agencies, privately owned higher education and research institutes, commercial sector researchers, private researchers, and the wider public. In 2024, the SPOZNAJ project consortium delivered training for future data stewards, beginning with the preparation of a Catalogue of Competences for Data Experts, which served as the foundation for a three-week training program. In addition to the training sessions, the project has published two handbooks on Open Science and Research Data Management Planning in Slovenian. A significant achievement of the project is the creation of a Data Stewardship Network, which officially began operations in December 2024. Presentation will address the following challenges: 1) Terminology in Slovenian: Adapting complex open science concepts to the Slovenian language. 2) Lack of Expert Lecturers: Shortage of qualified instructors for open science and data stewardship training. 3) Implementing Open Science Principles: Ensuring the adoption of open science practices across all research-performing organizations in Slovenia.
The UK Longitudinal Linkage Collaboration: the UK's national Trusted Research Environment for the longitudinal research community.
Andy Boyd (UK Longitudinal Linkage Collaboration, University of Bristol)
Robin Flaig (UK Longitudinal Linkage Collaboration, University of Edinburgh)
Jacqui Oakley (UK Longitudinal Linkage Collaboration, University of Bristol)
katharine Evans (UK Longitudinal Linkage Collaboration, University of Bristol)
Richard Thomas (UK Longitudinal Linkage Collaboration, University of Bristol)
Kirsteen Campbell (UK Longitudinal Linkage Collaboration, University of Edinburgh)
Rachel Calkin (UK Longitudinal Linkage Collaboration, University of Bristol)
Abigail Hill (UK Longitudinal Linkage Collaboration, University of Bristol)
Stela McLachlan (UK Longitudinal Linkage Collaboration, University of Edinburgh)
Emma Turner (UK Longitudinal Linkage Collaboration, University of Bristol)
The UK Longitudinal Linkage Collaboration (LLC) is the national Trusted Research Environment (TRE) for the centralised curation and integration of UK Longitudinal Population Studies' (LPS) data and the systematic linkage of participants' routine health, administrative and environmental records. Our objective is to realise new scientific opportunities by creating new combinations of data, to provide efficient access to well curated data and to provide meaningful public safeguards with full transparency of data use. LLC is a secure TRE hosting LPS and linked data within a remotely accessible analysis platform (provided by SeRP). Approved users can analyse integrated individual-level data within a strict governance framework co-developed with LPS data managers and public contributors. LLC has adopted the "five safes" decision-making principles to effectively balance curation requirements (e.g., data citation, processing transparency), security requirements (e.g., de-identification, minimisation and permission management) and efficient and predictable access routes to individual level data. LLC hosts data from >20 partner LPS with >320,000 participants. Centralised data pipelines link participants' NHS health records (primary, secondary and mental health care; prescriptions; mortality and disease registers) and socio-economic records (tax; work and pensions; education). Participant address data is being geo-coded to link environmental, neighbourhood and property data. Our design supports long-term sustainability, linkage accuracy and the ability to link data at both an individual and household level. Through establishing LLC as a connected and centralised TRE, the UK's interdisciplinary longitudinal community have "bridged the oceans" that separated LPS data and diverse linked data; "harboured" these sensitive data in a protected and trustworthy environment, and created a data foundation which "anchors the future" for research into priority cross-cutting questions relating to ingrained health and social inequalities, and health-social-environmental interactions, e.g., those driven by climate change. This model extends to federated analysis with international equivalents to enable research at a global scale.
Session theme/info: Ethics, Governance, CARE / FAIR
Synthetic data's role in shaping research and governance – insights from ADR UK projects
Emily Oliver (ESRC UKRI)
Lora Frayling (HDI)
Cristina Magder (UK Data Archive, UK Data Service)
Melissa Ogwayo (UK Data Archive, UK Data Service)
Maureen Haaker (UK Data Archive, UK Data Service)
Hina Zahid (UK Data Archive, UK Data Service)
Jools Kasmire (University of Manchester, UK Data Service)
Robert Robert Trubey (University of Cardiff)
Fiona Lugg-Widger (University of Cardiff)
Synthetic data is emerging as a transformative tool for enhancing data accessibility. With four expert presentations, we will provide a holistic view of how synthetic data is set to shaping the data landscape in the UK. This panel marks the culmination of a series of projects funded by the Economic and Social Research Council (ESRC) via the Administrative Research Data UK (ADR UK), which explored the benefits, costs, and utility of synthetic data from the perspectives of the public, data owners/providers and Trusted Research Environment professionals, and researchers. Despite key recognised advantages such as obtaining data familiarity, training and capacity building, challenges such as public trust, routine adoption, and scalable production persist. With the findings from these completed projects, this panel will provide a comprehensive analysis of synthetic data's potential and challenges. Expert speakers will share their unique perspectives on: • Public engagement: reflections on public understanding and acceptability of synthetic data, with actionable recommendations for building trust and transparency. • Data owners and TRE professionals: insights into best practices and lessons learned in implementing synthetic data collections at scale. • Researcher's perspective: reflections on how synthetic data might be used by researchers, therefore influencing what data, metadata and documentation must be made available for reuse. • Synthetic Data Working Group: outcomes of cross-organisational collaboration, with a focus on the development of standardised terminology in line with FAIR/CARE principles. This session will go beyond presentations, incorporating interactive activities such as live polling and a moderated Q&A. These engagements will enable attendees to apply findings to their own contexts, exchange ideas, and explore pathways for advancing synthetic data adoption globally. Join us to reflect on the journey of synthetic data, discuss its future potential, and collaborate on strategies to address persistent barriers in research access and governance.
Quality in Qualitative Data Archiving: Lessons Learned from Collaboration Between Data Producers, Project Managers, and Curators
Aubrey Garman (ICPSR)
Rachel Huang (ICPSR)
Vrinda Mahishi (ICPSR)
As the prevalence of qualitative data and mixed-method studies increases at the Inter-university Consortium for Political and Social Research (ICPSR), so does the need to establish best practices for archiving these data while safeguarding privacy and usability for future research. Effective qualitative data archiving requires a forward-thinking perspective at all points in the data lifecycle, which relies upon collaboration among data producers, project managers, and curators. Participants will leave with actionable lessons about how these partnerships improve the quality and accessibility of text-based qualitative data. Drawing on ICPSR staff's collective experience, this presentation will discuss common challenges and practical solutions unique to the qualitative data lifecycle from collection to dissemination, including: (1) creating rich metadata and documentation to improve the usability of qualitative data; (2) addressing privacy concerns through confidentiality safeguards, such as disclosure risk remediation, restricted data access, and output vetting guidelines; and (3) building collaborative workflows that align data producers, project managers, and curators during the archiving process.
Research data management and data sharing of qualitative data
Kati Mozygemba (University of Bremen, RDC Qualiservice)
Research data management (RDM) aims to support researchers in handling their research data. It has become an integral part of research practice and offers numerous services. However, the provided guidance and tools are often generic and the special features of qualitative research, such as the openness of the research process, the sensitivity and density of references and the heterogeneity of the materials, are rarely taken into account. As a result, RDM templates and instruments do not fit in practice and qualitative researchers have difficulties in applying them. To close this gap, QualidataNet - the network for qualitative data (www.qualidatanet.org) provides a research data management portfolio for qualitative research. QualidataNet is a service developed and operated as part of the National Research Data Infrastructure in Germany in the Consortium for Social, Behavioral, Educational and Economic Data (KonsortSWD-NFDI4society). The presentation looks at the special features of RDM of qualitative data and shows ways of addressing these in practice. The special handling of data protection and copyright issues in qualitative social research is considered as well as special features of the research data life cycle for qualitative research. We address how qualitative data can be anonymized and documented so that as much social science-relevant information as possible is preserved for subsequent scientific use. Guided by the idea of an ethical RDM, the portfolio aims to enable researchers to ethically reflect on their actions in the field. It is clear that, in addition to solutions that take into account the openness and heterogeneity of qualitative data, the flexibility of the instruments, ethical reflection skills and dataset- and project-specific aspects are key to supporting qualitative researchers in generating FAIR research data that can be used sustainably.
Lessons learned from collecting and managing an online communication dataset from right-wing extremist actors
Christina Dahn (GESIS - Leibniz-Institute for the Social Sciences)
Katrin Weller (GESIS - Leibniz-Institute for the Social Sciences)
Data from large online communication platforms enable researchers to study a variety of specific communication settings. But archiving and sharing data used for this kind of research is a challenge on its own. And it becomes even more challenging for particularly sensitive data. We present a use case for creating a dataset from online communication of German and Austrian right wing extremist actors. While many studies focus on a specific platform, powerful actors in the right-wing scene are connected by an online ecosystem encompassing multiple platforms. Often, hateful content is linked to on other platforms and actors tailor their language according to platform affordances (e.g., the level of content moderation). We have created a cross-platform dataset with data from Telegram and YouTube, that is based on a curated list of Telegram channels by Austrian and German right-wing extremists and that also includes the outgoing links from these actors' Telegram posts to the mainstream media platform YouTube. We will describe details of our procedures for planning, collecting, managing, annotating, and archiving the dataset. This includes lessons learned from seeking ethics approval. It also includes reflections on strategies for collecting data during a time of restricted public access to online platform data, and for managing the data during the collection process. It furthermore includes reflections on protecting researchers involved in the data collection, including student assistants who are helping with data annotation to prepare data for analysis. In order to be able to archive and share the dataset, we had to face the trade-off between ethical and legal concerns and data quality, resulting in several limitations of the final dataset. We summarize aspects that we consider useful as guidance for other work in this areas and that can be transferred to other cases of sensitive content from online platforms.
June 4, 2025: Concurrent Session C.2
Session theme/info: Data Discovery
A data services hierarchy of needs: how should we build a National Data Library?
Richard Welpton (UKRI: Economic and Social Research Council)
Kirsten Dutton (UKRI: Economic and Social Research Council)
For good reasons, numerous investments have been made in different infrastructures in the UK to support often specialist, discipline-led discovery, access and use of different data resources. At the same time, the data infrastructure landscape has become busy and fragmented. Yet the use of data created from a range of sectors now regularly straddles traditional academic topics. The pandemic highlighted the importance of ready access to a range of linked economic, environmental, health and social data. No single source of data can absolutely answer a research question (if it ever could). Despite the increasing availability of data, our Future Data Services (FDS) review describes how finding and accessing data is now more complex for the researcher. How should the existing data infrastructure landscape evolve to smooth this experience for researchers? The new UK government's manifesto referenced the creation of a 'National Data Library' (NDL). Can this provide a solution to the fragmented landscape described above? How might it build upon the infrastructures we have been investing in for years? How could it address the challenges that our review documented, such as the need for closer connections between existing infrastructures that support the federation of data services? We describe a progressive approach to data infrastructure development, modelled as a 'hierarchy of needs'. Assessing the core components to make data discoverable and accessible allows us to evaluate additional services, and the need for greater innovative capacity and maturity. From this assessment, a vision for an NDL becomes apparent. Our model allows us to frame our ambition for a federated data service landscape and helps us monitor our progress to achieving this goal. The aim is to simplify the journey for researchers. After all, "the only thing that you absolutely have to know, is the location of the library." (Einstein).
Data-driven (data) service improvement, the UKDS way
John Sanderson (UK Data Service)
Leigh Tate (UK Data Service)
As a large scale data provider, UK Data Service generates a significant amount of data about the users we support and their transactions with us. The demands on our service have grown significantly over the last few years, and a such we've had to meet the challenge of providing effective service without the benefit of additional resource. By using our own administrative data we have been able to respond effectively, resolve problems which have occurred, and find new opportunities to be more effective. This presentation will highlight several worked examples where utilizing our own customer interaction data enabled us to understand our own activity better, and improve service delivery approaches and provide an improved service to data users.
Scoping future library data services at the University of Sheffield (UK)
Holly Ranger (University of Sheffield)
This paper presents the methods and findings of a scoping study conducted by members of the Faculty Engagement and Scholarly Communications Teams at the University of Sheffield (UK) to inform proposals for future library data services. Taking account of the library's ambitions to facilitate both human and machine access to its open and subscription data, the project gathered information on sectoral best practice, user needs, and existing cross-university data services provision to identify potential services, to identify the resourcing required to realise those ambitions, and to harmonise the library's offering with existing services. The project engaged primarily in desk research but also conducted qualitative interviews with researchers to understand their research data needs and to understand the technical and contractual barriers they had encountered when attempting to access subscription data to which they had lawful access. While the project began with a narrow focus on realising the library's ambitions for collections-as-data, we found that researchers repeatedly raised the value of access to librarians themselves as archivists, copyright specialists, and systems and metadata experts. In our vision for future library data services, facilitating opportunities for research collaborations with our people will be as important as facilitating access to our collections.
Session theme/info: Consultation, Literacy, and Librarianship
Sensitive Data, Smarter Training: Implementing Asynchronous Canvas Modules for Data Safety Training
James Capobianco (Harvard University)
James Adams (Harvard University)
Megan Potterbusch (Harvard University)
At Harvard University, researchers working with sensitive data are required to undergo specialized training. For Harvard Kennedy School Master's in Public Policy students, this is particularly relevant, since their capstone projects often involve original data collection or managing sensitive information. In the past, librarians conducted multiple in-class training sessions for these projects. However, as the volume of these demands have grown, this model has become unsustainable. To address these challenges, our team transitioned to asynchronous Canvas modules during the current academic year. These modules cover essential topics, including identifying and classifying sensitive data, understanding global privacy regulations, and applying Harvard's data security framework. This shift not only ensures consistent, high-quality training but also enables the Library and Research Services team to support larger and more diverse cohorts of students without increasing staffing levels. This presentation will detail the development, implementation, and preliminary outcomes of these modules, which were piloted with variations across six capstone seminar groups. We will discuss the collaborative process of engaging with faculty seminar leaders and coordinating with university stakeholders, as well as the challenges encountered. Additionally, we will highlight how the transition to asynchronous training has freed librarians to focus on providing individualized support for the most complex data safety issues.
Implementing Moodle for training at the UK Data Service
Sarah King-Hele (UK Data Service)
The UK Data Service is developing a Moodle platform for delivering and managing online learning materials created by our training teams across the service. These materials focus on developing foundational data skills for finding and using the social survey, census and other data available from the UK Data Service, for managing research data and for learning computational social sciences skills. This presentation will share the journey of planning and implementing our Moodle, focusing on its use cases and the key considerations involved in aligning the system with the UKDS's needs.
Not dirty, not clean: the language of making changes to your data is a literacy issue
Carol Choi (New York University)
Specialization often leads to - if not necessitates - the coining and use of technical terms and jargon specific to a discipline. Many fields engage in data science and computational research and yet there isn't a universal, shared language for one of the most fundamental steps of working with data: some refer to it as cleaning, some are merely tidying, others wrangle, scrub, process, munge, manipulate, transform, et al.. These terms refer to everything from correcting encoding errors and typos to large scale content moderation, from normalization of results to altering data for nefarious purposes. And because data can pass through many hands and be used for a range of studies in different fields, these changes can have significant near and far consequences to methodologies, analyses, reproducibility, and much more. Some of these terms have already been identified as problematic – in particular, "data cleaning" and the related term, "tidying" – by data feminists, information scientists, and others. And in at least one discipline, a case has been made to assign specific (arguably more accurate) meaning to it which has not carried over to other fields. The lack of precision of these terms also obscures the reality of working with data in research; not only is "clean data" an inappropriate misnomer, but its use negatively affects expectations and belies the importance of the labor it refers to and its potential repercussions. Drawing on the work of and case studies from law, business, and economics in addition to those listed above, this presentation will take a multi-disciplinary look at what these terms refer to, the implications of the language used, and argues that a universal approach is a critical data literacy issue.
Session theme/info: Partnerships and Collaboration
Creating accessible data training for postgraduate students: a collaborative journey
Jen Buckley (UK Data Service, University of Manchester)
Alle Bloom (UK Data Service, University of Manchester)
To address the growing demand for accessible data skills among postgraduate researchers, the UK Data Service partnered with three providers of postgraduate training for the social sciences (Doctoral Training Partnerships (DTPs)), to develop the online course 'Introduction to Finding and Using Data'. In this project, the goal was to create high-quality data training that is accessible to all postgraduate students, regardless of discipline or background. The collaborative process involved consultation with the training providers to determine the content and structure of the online modules. We also trialled the modules in an initial pilot phase, where we recruited student consultants to participate in focus groups tasked with recommending changes to the course design. The valuable (but sometimes contradictory) feedback from students informed a redesign of the course. The presentation will discuss the benefits and challenges of this collaborative and iterative course design approach. It will also cover the challenges encountered in designing an accessible course to a diverse student body and showcase the final product.
Bridging cross-regional data instruction needs: Collaborative approaches to data services education
Christine Nieman Hislop (University of Maryland, Baltimore)
Justin de la Cruz (NYU Langone Health)
The Network of the National Library of Medicine (NNLM) is a cross-institutional, collaborative organization funded by the National Library of Medicine to do outreach and education across the entire United States. Seven regional libraries (headquartered out of separate academic institutions) are tasked with engaging organizations in their regions, and – through coordination with three national offices and three national centers – providing education on a national scale around health sciences topics, including health data. While NNLM regions have historically worked independently to provide data services education to their state-designated constituents, data education needs are similar across geographic regions. With the formation of the NNLM National Center for Data Services in 2021, data professionals collaborated cross-regionally to combine efforts in this area. One example was hosting programming workshops from The Carpentries, which progressed from organizing 1-2 regionally-funded workshops each year to funding and coordinating 10 centrally-organized Carpentries workshops with regional and national focus, all hosted within one year. In response to reduced enrollment and training evaluation feedback, we reassessed our approach and pivoted from full-scale Carpentries workshops to adapting existing Carpentries curriculum materials (which are open source) to develop targeted training for a health sciences audience. This presentation will cover collaborations across the NNLM to provide data services education on a national level and how these educational offerings have evolved in response to regional and national assessments. This includes scaling classes, adapting educational materials, assessing programs, and gathering qualitative feedback from participants. Attendees will learn tips for meeting data instruction needs cross-regionally through collaboration and innovation, and presenters will discuss ideas for responding to emerging needs in data professional development education.
Exploring the Collaborative Landscape of GIS and Data Services in a Graduate School of Design
Bruce Boucek (Harvard University GSD)
Librarians and other academic staff who provide GIS and data services are sometimes treated as unicorns or invisible, siloed, and isolated experts. This presentation seeks to dispel this impression by providing an exploration of the current collaborative landscape necessary for providing geospatial (GIS) and data services at Harvard University's Graduate School of Design (GSD). The intent of this presentation is to illuminate and elevate the collaboration that is necessary to provide the expertise and support required by GSD students and faculty to do scholarship and design work that meets their standards. An outline of the scope of responsibilities of the GIS, Data, and Research Librarian (GDRL) at the GSD will be provided as a framework for identifying the many different types of collaboration that are necessary to accomplish each task. Workshops, class sessions, and 1 to 1 and small group consultations all require a broad range of resources. Relationships with and expertise provided by many different units in the school, in Harvard Library, and with Harvard University will be identified and explored. The GSD's Information Technology group provides the computing infrastructure for the GSD community, the GSD's FabLab makes it possible to transform data into physical objects to be inspected visually and tactilely. Harvard's Center for Geographic Analysis licenses software and data, and software as a service tools. Harvard's Map Library provides digitized data, historical map collections, and scanned map collections. Harvard Library also provides vast collections and general digital scholarship expertise. These are all examples of the collaborative landscape necessary for the or a GIS, Data, and Research Librarian to excel in their work. Perhaps the experiences of one individual at one institution can provide applicable lessons for other experts and their institutions.
June 4, 2025: Concurrent Session C. Panel
Session theme/info: Timely Topics
Back to the rough ground! Retrieving concepts in survey research and its potential uses
Deirdre Lungley (UKDS, University of Essex)
Suparna De (University of Surrey)
Chandresh Pravin (University of Surrey)
Jon Johnson (CLOSER, University College London)
Paul Bradshaw (ScotCen)
Survey design and fielding of questionnaires exerts significant effort into asking the right questions to elicit high quality data from respondents. Yet as a researcher coming to data from archives much of this information is lost or locked up in PDFs that is burdensome to use and a barrier to the ambitions of FAIR. The technical capability to serve up such metadata is well served by standards such as the suite of DDI standards. Populating such schemas at scale will however need a step change in the way metadata is utilised in the data lifecycle. The absence of high quality question banks and paucity of 'this is how you do it' projects are demotivating factors for adoption. The ESRC Future Data Services pilot project between CLOSER, University of Surrey, UK Data Service and Scotcen is tackling these issues, utilizing the CLOSER metadata repository as a training (meta) data set to develop novel machine learning approaches to the extraction of metadata from survey questionnaires, conceptual extraction and alignment of questions and the use of concepts to drive machine actionable disclosure assessment. The presentation will report on progress in these three areas
Takeaways from 'UKDS PRUK - Skills Development for Managing Longitudinal Data for Sharing'
Finn Dymond-Green (JISC, UK Data Service)
Beate Lichtwardt (UK Data Archive, UK Data Service)
Cristina Magder (UK Data Archive, UK Data Service)
Population Research UK (PRUK) is a new national resource, which is maximising the potential of Longitudinal Population Studies (LPS) data by bringing together and developing the infrastructure, processes and people that will enable LPS data to be efficiently enhanced, accessed and analysed. ESRC and MRC, part of UK Research and Innovation (UKRI), commissioned activities to inform and help address some critical challenges and opportunities facing the LPS community in preparation of the PRUK launch. 'Skills Development for Managing Longitudinal Data for Sharing' was one of these initial projects. The project aimed to increase and expand the skills and expertise available to support the wider sharing and increased use of LPS data resources. It developed the understanding of biomedical-specific data sharing barriers, built on existing knowledge and led to the preparation of further training materials to be used within the LPS community across social and biomedical sciences. Our presentation summarises the lessons learned during this 18-month project and highlights the impact and takeaways of this initiative. It reflects the UK Data Service's response to a call to action from funders, offering insights into how to expand and improve training provision for managing and sharing LPS data in alignment with Research Data Policies.
Innovative uses of Digital Object Identifiers (DOIs) at UK Longitudinal Linkage Collaboration to aid curation and interpretation of a complex data integration and a dynamic population sample
Katharine Evans (UK LLC, University of Bristol)
Richard Thomas (UK LLC, University of Bristol)
Rachel Calkin (UK LLC, University of Bristol)
Abigail Hill (UK LLC, University of Bristol)
Emma Turner (UK LLC, University of Bristol)
Andy Boyd (UK LLC, University of Bristol)
UK Longitudinal Linkage Collaboration (LLC) is the national Trusted Research Environment (TRE) for the UK's longitudinal research community. LLC integrates data from many UK Longitudinal Population Studies (LPS) and systematically links participants' health, environmental and socio-economic records into a centralised TRE. The breadth and volume of data, the many different data owners, and the complexity and dynamic nature of the LLC pooled sample requires innovative and flexible data curation approaches to ensure LLC supports FAIR data principles. LLC's model encompasses a wide range of components, each of which will be assigned a DOI that will resolve to LLC Guidebook (https://guidebook.ukllc.ac.uk) or LLC's Data Use Register. With respect to the data in the TRE, DOIs will be assigned according to the following hierarchy: (1) individual data files; (2) logical groupings of data files, e.g. those collected at the same timepoint; and (3) data owner, e.g. an LPS. In addition, a DOI will be assigned to each 'data freeze'. Currently, the LLC sample includes 323,775 participants from 20 LPS, but the sample will change over time as new participants or new LPS join and others withdraw. Following quarterly updates from LPS, LLC establishes a 'data freeze', which comprises information summarising the broad characteristics of the LLC sample at that point in time. Lastly, a DOI will be assigned to each approved project, reflecting the unique/seldomly shared combination of provisioned data and the related data freeze (the project DOI will encompass all 'child' DOIs). This will promote reproducibility, aid inference and permit straightforward citation. A complex data scenario at LLC is driving innovative uses of DOIs, with an expansion to include additional components likely. A priority is to ensure clarity to users, in particular how the LLC approach aligns with the way DOIs and data are structured in other resources.
In the Moment Costs: A Longitudinal Study of Data Management & Sharing
Alicia Hofelich Mohr (University of Minnesota)
Joel Herndon (Duke University)
Cynthia Hudson Vitale (John Hopkins University)
Jacob Carlson (University at Buffalo)
Shawna Taylor (Association of Research Libraries)
Despite the growing importance of including data management and sharing costs in research budgets, estimating these specific costs remains difficult. Many grants submitted to the US National Institute of Health (NIH) since the 2023 Data Management and Sharing Policy included $0 or no specification of data management and sharing costs, despite specific guidelines to include them. Previous research conducted by the Realities of Academic Data Sharing (RADS) initiative provided initial estimates of these costs based on a retroactive survey of researchers who completed grants with data sharing requirements within the last decade. However, specific financial information was difficult for many participants to recall, and further refinement and assessment of data management and sharing costs are needed. This presentation will describe follow up research being done by the RADS initiative to refine these estimates. In this project, three faculty who were awarded grants under the 2023 NIH Data Management and Sharing Policy are followed and surveyed longitudinally throughout their project. We will present results from the first four meetings, which aligns with approximate half way through each grant. Of the projects included, none included a specific budget for data management and sharing in their grant applications. However, after two meetings and a year and half into the grants, PIs reported both themselves and their collaborators doing data management activities and spending between 4 and 1200 hours on data management and sharing activities, with an average cost of $46,000 in time alone.
Financial Data Stewardship: A Librarian's Approach to Data Discovery and A Call for Coding as a Path to Data Comprehension and Verification
Lip Hwe Tee (Singapore Management University Libraries)
Datasets especially financial datasets can be hard to understand. Among many complexities, understanding the data schema is particularly important to be able to extract the data required correctly. This conference presentation calls to advocate and emphasise coding as a means to understand datasets, to be able to correctly interpret table structure and relevant fields to extract the correct data, to ensure accuracy and relevance. An example or two of how to work with WRDS data will be used to illustrate the importance of this premise in understanding dataset contents and relationships, to support data extraction and uphold data integrity to support learning and research. Using a case study on retrieving historical S&P 500 membership data from the WRDS database, this presentation illustrates how coding enhances comprehension of dataset intricacies and improves research or data extraction outcomes. The article "Notes and Thoughts on Retrieving Historical Members of S&P 500 from WRDS" provides a clear data learning path and a walkthrough for querying historical members of the S&P 500 using WRDS with Python code. The incremental approach taken helps learners understand both the coding process and dataset structure. Data extraction is only as true as its data comprehension, particularly in the domain of financial datasets. In the context librarians are increasingly seen as collaborators in research than just custodians of information and data, the role of librarians as active data interpreters through coding is highly valued. This presentation aims to inspire librarians to adopt coding as a tool not just for extraction but for critical engagement with datasets, underscoring the importance of computational literacy in bridging the gaps in dataset discovery, ensuring accurate, meaningful data retrieval for academic and professional endeavours.
Data Availability in Teaching Effectiveness Research
Talha Sajjad (DIPF | Leibniz Institute for Research and Information in Education)
Johannes Hartig (DIPF | Leibniz Institute for Research and Information in Education)
Thomas Lösch (DIPF | Leibniz Institute for Research and Information in Education)
Carmen Köhler (DIPF | Leibniz Institute for Research and Information in Education)
Sharing research data promotes transparency and cumulative knowledge building, which is particularly important in fields where data collection is resource-intensive. One such field is teaching effectiveness research, which examines how teaching quality impacts student learning. Drawing valid conclusions in teaching effectiveness research requires classroom-level data involving both students and teachers, measurements of teaching quality and learning outcomes, and a longitudinal design. Despite its potential benefits, earlier studies reported wide variation in data availability, with data being accessible for 10% to 50% of published research articles. Our study investigates data availability in teaching effectiveness research by assessing access to datasets from a sample of published research articles. We sampled articles from meta-analyses in the field, identifying 167 studies that met our inclusion criteria: longitudinal design, peer-reviewed, primary data, and publication after the year 2000. For each article, we checked for published datasets. When no public dataset was available, we contacted corresponding authors to request access. Only 13 publications included links to published datasets. Among contacted researchers, one-third had outdated contact information. Of the remaining researchers, 32 responded, with 9 agreeing to share their data. The overall availability rate was 13.2%, with the odds of data-sharing decreasing by 17% annually post-publication. Data availability in this applied research field was lower than the approximately 25% availability reported for basic research in experimental psychology. These findings highlight the importance of archiving data in repositories, as availability upon request remains limited.
Session theme/info: Consultation, Literacy, and Librarianship
Evaluating Data Sharing Practices in Top Social Science Journals: Insights and Implications
Jiebei Luo (New York University)
The rise of data-driven research and the open data movement, supported by initiatives like the FAIR Principles, TOP Guidelines, and DA-RT statement, has significantly advanced journal policies on data sharing, reproducibility, and citation in social science. Building on the study by Crosas et al. (2018), which analyzed the top 50 ranked journals across six social science disciplines, the project broadens its scope by examining the data policies outlined in the author guidelines of the top 10% of journals within each sub-discipline under social sciences. Using the 2023 Journal Impact Factor (JIF) from the Web of Science, the study includes 814 journals spanning 49 social science disciplines. This presentation examines changes in data sharing and availability policies in top social science journals, focusing on key aspects such as requirements for data sharing, linking, references, and statements. It evaluates policy compliance across social science disciplines and discusses the implications for librarians, particularly those in data services units.
The Proliferation and Predatory Journals and Institutional Response in Ethiopia.
Mohammed Yimer (Woldia University)
The paper examines the proliferation and propensity of publications in predatory journals by academics in institutions of Higher Education in Ethiopia. Predatory journals are known for collecting money at the expense of scientific knowledge. The study consulted authors and their published articles from seven purposively selected universities. The study found that over 89 percent of the articles were published in predatory journals. If this trend continues, it may result in the development of incompetent research and the proliferation of poor quality research, which would tarnish the reputation of the academe in Ethiopia and hamper the development endvour of the country. The phenomena may also result in knowledge loss and ruins the prestige of institutions. Limiting the scope of databases in which the journals should be indexed and accredited to Web of Science and ProQuest may help to address this scourge significantly.
June 4, 2025: Concurrent Session D.4
Session theme/info: Partnerships and Collaboration
Laying the Foundation: A Pilot National Research Data Bootcamp
Jennifer Abel (University of Calgary)
Jane Fry (Carleton University)
Nick Rochlin (University of Victoria)
Mathew Vis-Dunbar (University of British Columbia, Okanagan)
Current research data management (RDM) training in Canada is largely based around 'one-shot' sessions focused on Data Management Plans (DMPs) and data deposits, and often misses key concepts and activities of computational reproducibility that are essential to maintaining data integrity throughout a project's lifecycle. While there are other, more computationally-focused workshops offered nationally, feedback from participants suggests that these training sessions can be overly advanced and fast-paced for those new to these technologies. To begin addressing this gap in data training, librarians and graduate students from four Canadian universities, supported by funding from the Digital Research Alliance of Canada, will pilot a week-long, 20-hour national data bootcamp, aimed at graduate students and early career researchers with little or no computational background. The bootcamp, that will be delivered in May 2025, will walk participants through the research data lifecycle using a mock project, and focus on connecting research data management best practices (documentation, storage, sharing, and preservation) to best practices in scholarly integrity (registrations and computationally reproducible workflows), which are intricately connected but oftentimes presented as separate practices. In addition to delivering the bootcamp, a key output will be robust and openly accessible asynchronous training materials for each session of the bootcamp. The broader vision for this program is to develop additional bootcamps in the coming years, which would aim to define "introductory", "intermediate", "advanced", and "expert" level training, and provide a seamless educational trajectory in data and computational training. This presentation will begin by introducing the context in which this pilot was developed, including gaps and opportunities in the Canadian training landscape. We will then discuss the process of developing and delivering the program, and will conclude by addressing lessons learned and future directions.
Bridging evaluation and implementation: Using results from a survey of research data repository administrators to anchor community-driven initiatives
Meghan Goodchild (Queens University)
Alisa Rod (McGill University)
Julie Shi (University of Toronto)
Two years have passed since we launched a Canadian survey of Borealis research data repository administrators. The goal of the survey was to identify gaps in the data repository services landscape that might be collaboratively addressed by a national community of research data management librarians, data specialists, data repository managers, and infrastructure providers. This presentation will focus on how results of this survey have helped to steer the launch of new community-driven initiatives. Three primary gaps identified by the survey include capacity or support for developing curation models, preservation planning and workflows, and support for sensitive data deposit in the context of limited staffing capacity. Regarding curation and preservation, we will discuss how the survey results supported the relaunch of a community initiative to update and develop new resources and documentation for Borealis institutional research data repositories seeking to apply for CoreTrustSeal (CTS) certification or to benchmark their services. CTS requirements mandate specific levels of preservation and curation activities that align with gaps identified in our survey results. Borealis-specific resources on how members of our national shared research data repository infrastructure may implement service models to meet CTS requirements also provides guidance on the resources and capacity required for harmonizing curation models to international standards. We will also discuss how the survey results have helped to shape the work of a collaborative group of librarians and data repository administrators aiming to draft guidelines and a checklist for sensitive data deposit as contextually defined by a combination of institutional and national policies, regulations, and frameworks. For example, we will discuss standardizing the guidelines to common levels of risk related to research data. This presentation will also address variations in institutional-level staffing models and how we plan to use a longitudinal survey design to track shifts in readiness and capacity over time.
Building a community around DDI: The European DDI Users Conference
Jon Johnson (CLOSER, Social Researfch Institute, UCL)
Joachim Wackerow (Independent Consultant)
Mari Kleemola (Finnish Social Science Data Archive, Tampere University)
The annual European DDI Users Conference (EDDI) was established in 2009 to bring together users of the DDI standards, to exchange ideas, and support capacity building loosely based around the IASSIST model. The DDI (Data Documentation Initiative) standards are open standards for describing and managing data from social, demographic, economic and health sciences. This presentation will outline the development of the Conference, and how it has evolved to reflect the changing needs of data producers, managers and disseminators. The presentation will also discuss the challenges of bringing together a diverse community of metadata and data producers and users, where EDDI has succeeded and where there is room for improvement.
Session theme/info: Ethics, Governance, CARE / FAIR
How are data archives modernizing their platforms for FAIRness and sustainability?
Margaret Levenstein (University of Michigan, Institute for Social Research, ICPSR)
Steven McEachern (UKData)
Jeannette Jackson (Institute for Social Research)
Panelists: Maggie Levenstein (Director, ICPSR), Steve McEachern (Director, UKData), and tbd (hopefully Skit(Norway)). Moderator: Jeannette Jackson (Managing Director, Research Data Ecosystem, ICPSR) Explore how ICPSR and UKData are modernizing their software platforms to ensure data is FAIR and to enable researchers to remain at the forefront of social science research, making new scientific discoveries possible. Importantly, modernizing these software platforms helps future-proof these institutions, ensuring they are flexible and extensible enough to support different data types and advancing interconnected social science research. Additionally, we will share how this modernization contributes to the long-term sustainability of data archives. The panel structure will include a 10-15 minute presentation by each panelist (depending on whether there are 2 or 3 panelists). During their presentations, panelists will discuss their current user and technical challenges, the steps they are taking to address these challenges through software modernization, and how this work contributes to the organization's long-term sustainability. Following the presentations, there will be a 15-minute moderated discussion among the panelists, focusing on shared challenges and opportunities in infrastructure and service development, approaches to data deposit and cataloging methods, and strategies for curation flow, particularly concerning standardization or innovation in curation processes. The session will conclude with a 15-minute audience Q&A.
June 5, 2025: Concurrent Session E.1
Session theme/info: Data Management
Assessing data management and sharing plans: The "state of play" at Duke and opportunities for cross-campus collaborations
Sophia Lafferty-Hess (Duke University)
Jenny Ariansen (Duke University)
Jen Darragh (Duke University)
William Krenzer (Duke University)
Over the past few years, the United States has implemented a second round of data management policies, exemplified by the 2023 NIH Data Management and Sharing Policy and 2022 "Nelson Memo." Effectively supporting public access to data and a data sharing culture at an academic research institution requires collaboration across various research support staff and central offices as well as knowledge of the current practices of researchers. Two research support groups at Duke University, the University Libraries (DUL) and the Office of Scientific Integrity (DOSI), have forged a strong working relationship for supporting data management and sharing practices, including an active Teams channel for communication, developing tools collaboratively, delivering trainings, and providing co-consults for data management. To more effectively understand "the state of play" at our institution, DUL and DOSI worked on a project analyzing the content of data management and sharing plans (DMSPs) submitted to the National Science Foundation (NSF) in 2021. The project team used a modified version of the DART rubric (https://osf.io/qh6ad/) to score DMSPs against required elements in key areas, including types of data; standards for data and metadata; access, sharing, and preservation; limitations on access, distribution, and reuse; and roles and responsibilities. In this presentation we will present the key findings from the DMSP assessment project and discuss how, as data management specialists, we can use this information to plan for ongoing education, training, and resource development using a cross-campus collaboration model. Likewise, we will reflect on the formation and maintenance of these types of working relationships between libraries and research support offices.
Harboring an ocean of synthesized knowledge: Health sciences librarians' perspectives on depositing knowledge synthesis search strategies in research data repositories
Daniela Ziegler (Centre hospitalier de l'Université de Montréal)
Health sciences librarians in Canada have been increasingly depositing and sharing knowledge synthesis (KS) search strategies and database exports via research data repositories. By defining search strategies as the code that extract the data of database exports, we can expand the mandate of research data management (RDM) infrastructure to include this work (Rod and Boruff, 2024). To better support these initiatives, it is important to understand the perspectives of librarians on their comfort or experience in using research data repositories to deposit KS work for long-term preservation and re-use. This presentation reports the results of a survey of Canadian health sciences librarians' perspectives on the RDM aspects of KS search strategies and the potential barriers they may face. We invited 498 individuals to participate if they were listed as a health sciences librarian or specialist on public websites of academic, hospital, government, or special libraries in Canada. We received 128 submitted responses for a 25.7% response rate. A large majority of participants (84.8%, n = 125) agreed or strongly agreed that "search strategies and their related output files are the equivalent of research data and code for a [KS] publication". A majority also agreed or strongly agreed (59%, n = 127) that KS projects should have data management plans (DMPs) and that depositing search strategies facilitates open science (93.8%, n = 128). More than half of the participants also reported being somewhat or very concerned about their knowledge level of RDM (56.7%, n = 127) and repository platforms (51.9%, n = 127). Encouragingly, more than half of the participants indicated interest in increasing their skills in a variety of RDM-related topics (e.g., DMPs, data deposit, etc.). We will discuss how the survey results could inform collaborations to develop best practices for health sciences librarians integrating RDM activities into KS research.
We're gonna need a bigger boat: Scaling up repository support for larger data
Michael Shensky (University of Texas at Austin)
Courtney Mumma (Texas Digital Library)
Laura Sare (Texas A&M University)
Robert Kalescky (Southern Methodist University)
Millicent Weber (Baylor University)
Bryan Gee (University of Texas at Austin)
As technologies and computational methods continue to improve, researchers are progressively producing both more data and larger datasets. These larger datasets pose challenges to maintaining the sustainability and capacity of the data repositories that researchers are increasingly expected to utilize for openly publishing their datasets. The Texas Data Repository (TDR) is one such repository currently grappling with these challenges and is, in response, working to refine its service model, technical infrastructure, and data retention policy. The Texas Digital Library which hosts the Texas Data Repository strives to develop collaborative solutions and relies upon the expertise of its service users to address community needs. Following this approach, the Texas Data Repository Steering Committee's subgroup for Larger Data has developed recommendations for how to scale up support for large datasets while allowing control at the institutional level in our multi-institutional repository. In this presentation, we will share these recommendations, the progress that has been made so far, and our strategy for working within the open source Dataverse community to expand the system beyond our own service needs. The material covered here should be of interest not just to managers of other Dataverse instances, but to all who manage data repositories and rely on them for preserving and publishing large datasets.
Enhancing Open Science Responsible Research Assessment and Decision-Making with the OpenAIRE Graph: The Role of Research Librarians
Maja Dolinar (OpenAIRE AMKE, Slovenian Social Science Data Archives)
Giulia Malaguarnera (OpenAIRE AMKE)
Stefania Amodeo (OpenAIRE AMKE)
Harry Dimitropoulos (Athena Research and Innovation Center)
Leonidas Pispiringas (OpenAIRE AMKE)
Ioanna Grypari (Athena Research and Innovation Center)
Ivana Koncic (Ruđer Bošković Institute)
The evolving landscape of scholarly research communication continually necessitates innovative approaches to responsible research assessment and decision-making. At the forefront of this evolution, research librarians are uniquely positioned to empower Open Science initiatives by leveraging the OpenAIRE Graph. This presentation delves into the critical contributions of research librarians in shaping the OpenAIRE Graph as a pivotal resource for monitoring research impact beyond traditional bibliometric indicators. By adopting the OpenAIRE Graph, research libraries can champion the principles of openness and transparency, thereby providing Research Performing and Funding Organisations (RPOs and RFOs) with non-commercial, robust analytical tools. The OpenAIRE Graph, comprising over 242 million research products—of which 81 million are open access—serves as a comprehensive database enriched by contributions from the research community. This resource facilitates the identification of metadata schemas, reporting missing information, linking persistent identifiers, and guiding researchers towards trusted resources. The presentation will guide participants through the OpenAIRE ecosystem, highlighting how various services—such as OpenAIRE Provide, OpenCitations, OpenAPC, and OpenAIRE MONITOR—integrate with the Graph to offer extensive insights into research outputs and Open Science activities. These services enable dynamic data visualizations, monitor research impact, and support open access tracking at institutional and national levels. In conclusion, this presentation underscores the significance of research librarians in advancing Open Science policies through the OpenAIRE Graph. By illustrating the collaborative efforts of the community in curating and sustaining open tools, the presentation highlights the strategic role of libraries in navigating and nurturing the future of data services, fostering a culture of transparency and community-driven innovation.
Sharing is Caring (about Research): Addressing Challenges in Sharing Protected Text Data Collections through Non-Consumptive Research
Johannes B. Gruber (VU Amsterdam)
Wouter van Atteveldt (VU Amsterdam)
Computational social scientists increasingly rely on "found" data—such as videos, images, and text—rather than "designed" data generated through traditional experiments and surveys. This shift offers advantages, as found data derived from online behavioral traces can mitigate issues like ecological validity, social desirability bias, and recall bias. However, using found data also presents significant challenges, particularly regarding data ownership and sharing. While collection and analysis of online data often fall under fair use principles, these exceptions typically do not extend to making the data available for follow-up research or reproduction. This creates a fundamental tension between adhering to legal, privacy, and ethical standards and the principles of transparent, reproducible research. In this contribution, we address the legal, ethical, and technical challenges of sharing text data collections by proposing three complementary strategies: 1. Distributing pre-processed text versions that prevent reconstruction of the original content, thereby protecting data owners' interests 2. Sharing metadata that enables data reconstruction if the original data is still availabe online 3. Making non-consumptive research capabilities available that allow comprehensive data analysis without directly consuming (i.e., reading) the text Non-consumptive research, pioneered by Google Books, which allows users to search for specific keywords within books without displaying the entire text, can involve simple keyword searches and frequency analyses, but also more sophisticated techniques. The three avenues are not mutually exclusive and can be strategically combined to maximize text dataset sharing within legal and ethical constraints. To demonstrate the practical implementation of these strategies, we have developed software tools that operationalize them. By anchoring our approach in the dual principles of reproducibility and ethical responsibility, this research bridges the divide between data accessibility and protection. Our work offers a forward-looking pathway for sharing text data that maintains the intellectual integrity of computational social science while respecting privacy and legal considerations.
Do the right thing. Enhancing Research Practices through Data Citation
Christina Bornatici (Swiss Centre of Expertise in the Social Sciences (FORS))
Tuomas J. Alaterä (Finnish Social Science Data Archive (FSD))
André Jernung (Swedish National Data Service (SND))
Lisa Tveit Sandberg (Norwegian Agency for Shared Services in Education and Research (Sikt))
Helena Laaksonen (Finnish Social Science Data Archive)
Data citation is essential for recognising research data as independent scientific outputs, promoting transparency, and ensuring reproducibility. Proper data citation ensures that researchers receive appropriate credit for their work. It also makes data findable and accessible, facilitating the verification and replication of research. Machine-actionable data citation enhances the visibility of research by automatically establishing relationships between publications, authors, and data. However, despite its clear importance, current data citation practices often fall short of supporting openness, FAIRness, and the ethical and impactful use of research data. To address this gap, the CESSDA Key Topic Working Group on Data Citation has developed comprehensive recommendations to foster a sustainable data citation culture. These recommendations identify the core components of a data citation (author(s), title, publication year, version, data publisher, and a persistent identifier) and supplementary elements that can add value. Even more importantly, they provide concrete, advised best practices and technical implementations for various stakeholders, including authors, publishers, data repositories, research performing organisations, policymakers, and ethics committees. These practices highlight the benefits of proper citation and make data citations more effective and impactful. This presentation details these recommendations, their motivations, and the planned dissemination strategy. We also present an analysis of current data citation practices and support among CESSDA service providers, highlighting current strengths and areas for improvement. The results are based on a 2024 survey that informed the service provider community about the recommendations in development. By sharing these insights, we aim to foster dialogue and collaboration and encourage the broader research community to adopt and support improved data citation practices.
Hannele Keckman-Koivuniemi (Finnish Social Science Data Archive / Tampere University)
Matti Heinonen (Finnish Social Science Data Archive / Tampere University)
The Finnish Social Science Data Archive (FSD) tracks data downloads and data use in many ways. We need to know who is using our data and for what purposes. FSD has even created and opened the dataset FSD3424 FSD Data Reuse 2015-2020 on this and will soon update it. In this presentation, we share recent statistics on the use of FSD data and discuss the social impact of data reuse. FSD maximise the value and impact of FAIR research data, save student's, teacher's and researchers' time and effort, and enable better research. FSD has been in operation since 1999. We provide access to a wide range of digital research data for learning, teaching and research purposes. All datasets archived at FSD are available on Data Portal Aila, free of charge. Currently there are more than 2,000 quantitative and qualitative datasets on Aila. FSD services are widely used. Aila has approximately 6,400 registered users and they include students, teachers and researchers from Finnish universities and universities of applied sciences as well as many institutions abroad. Roughly ten percent of all Aila users are from outside Finland. All in all, between 2014 and 2024, approximately 41,700 datasets have been downloaded from Aila. Conditions of use apply to most datasets on Aila and downloading them requires registration. There are some 160 datasets on Aila that are freely available to all users without registration.
So, you plan to set up a third-party verification and replication service in your institution. Knowing things you must consider before embarking on this massive undertaking would be good. In this presentation, I will discuss our experience when we expanded our verification and replication service in 2024, including hiring/staffing (skills requirements, hours, in-person vs virtual, HR policies), training (developing training manuals, designing workflows, recording and applying new learnings), tools for communicating with the team and clients, investments in software and hardware, progress tracking; metrics for evaluation, accounting, payroll, and reporting; volume and duration of work, and more.
Questionable Things, Extraordinary Things, and Making Sure the Data Gods Let You Into Heaven: A Case Study of Introducting Graduate Students to Cleaning Data
Robert O'Reilly (Emory University)
The Guardian's 10 Rules of Data Journalism including the following: Data journalism is 80% perspiration, 10% great idea, and 10% output (https://www.theguardian.com/news/2014/mar/17/facts-are-sacred-exclusive-extract). While the percentages vary from situation to situation, this rule does capture a simple truth: the grubby work of getting data into shape is often much more time-consuming than the (maybe) glamorous, red-carpet work of analyzing data and presenting results. However, methods classes often focus more on the glamor than the grubbiness. Thesis and dissertation students are often left to their own devices to figure out how to work with data that are much more "messy" than the cleaned-up data they work with in classes. How, then, to address this disparity? In this presentation, I will talk about organizing labs on cleaning data as part of a Public Health class from the Spring of 2024 on working with administrative and geospatial data to research drug-related harms and policy interventions. I will provide background for how I got involved in the class, what principles and particulars I tried to convey in the labs, and the replication assignment I created for the students in the class to bring together the material in the labs. The presentation will also discuss how I am updating the labs and material for the class in the upcoming Spring 2025 semester.
Session theme/info: Ethics, Governance, CARE / FAIR
From data discovery to access: a use case of how the FAIR principles have been leveraged to design Quetelet-Progedo
Nicolas Sauger (Progedo | Sciences Po)
Frédérique Gros (Progedo | CNRS)
Ami Saji (Progedo | CNRS)
Quetelet-Progedo (https://data.progedo.fr/) is the French national data infrastructure Progedo's (https://www.progedo.fr/) data repository dedicated to the social sciences. Its current holdings consist of over 1,650 datasets produced by over 100 data producers and covers a wide range of topics relevant to the social sciences (e.g., demographics, income, employment). Since its conception, Quetelet-Progedo has been designed with the goal of facilitating data discovery and access by leveraging the FAIR principles. The latest version, which was launched in spring 2024, is demonstrative of this pursuit as Quetelet-Progedo now offers improved access to pseudonymised data while striking a balance between security and facility of access. This presentation will therefore examine the important role played by the FAIR principles in designing Quetelet-Progedo to arrive at its current state. It will also reflect on how these same principles can be further exploited to drive future enrichments to Quetelet-Progedo that will improve user experience vis-à-vis data discovery and access.
FAIRly Specialist? Generic vs Disciplinary implications for FAIR-enabling Repositories
Hervé L'Hours (UK Data Service)
Oliver Parkes (UK Data Service)
Mari Kleemola (Finnish Social Science Data Archive)
Maaike Verburg (Data Archiving and Networked Services)
The FAIR acronym of Findable, Accessible, Interoperable and ReUsable is specified through 15 Principles that range from the specific and testable ('assigned an identifier') to the more general and aspirational ('richly described with a plurality of accurate and relevant attributes'). They are supported by indicators, including those developed through the Research Data Alliance, and a range of emerging tools that specify metrics and apply tests to digital objects including traditional research data, ontologies and software. For trustworthy digital repositories, enabling FAIR digital objects has been added to the range of outcomes expected from their retention, curation and preservation of assets relevant to research. This talk will consider the FAIR Principles from the perspective of the UK Data Service as the UK service provider to the Consortium of European Social Science Data Archives (CESSDA) and its lead partner, the UK Data Archive, as a trustworthy digital repository. It will include insights from the CESSDA Trust & Landscape group, the EOSC Long Term Retention Task Force, the EOSC FAIR Metrics and Digital Objects Task Force, the closing phase of the FAIR IMPACT project and the initial stages of the EOSC EDEN and FIDELIS projects. We will discuss where FAIR-enabling for generic and disciplinary repositories align, where they differ and how they can, together, improve the levels of care provided across the digital object ecosystem. From a metrics and testing perspective, we will review the benchmarks set for social science metrics and how they can be supported and improved through community participation and consensus. Furthermore we will consider the transparency of metadata at the repository and object levels and where there remain significant challenges for tools seeking to automate and scale FAIR assessments.
Data Sovereignty, Cultural Heritage and CARE in Service of FAIR
Regina Roberts (Stanford University Libraries)
Come hear from one of the 45 network participants on the 3-year IMLS grant funded project "Advancing FAIR +CARE Practices in Cultural Heritage". The FAIR+CARE Cultural Heritage Network established by this project is a collaborative structure to support the communication and coordination of FAIR+CARE-related practices across disciplinary, organizational, divisional, and geographic boundaries. The network brings a balance of synergistic strengths from libraries, cultural resource management firms, data repositories and publishers, museums, agency and regulatory representatives, professional societies, academic organizations or projects curating and reusing synthesized data, educators, and Tribal nations. This talk will provide information about the FAIR + CARE network, and report on the progress of the project, thus far. The first survey conducted produced more questions and highlighted the complexities of advancing digital data governance models that are important for archiving cultural heritage data. Learn about the next phases of this work and how this next phase plans to encourage meaningful and ethical collaborations for archiving Indigenous Knowledge.
Discussion panel: Data are plural, so are data stewards!
Robin Rice (University of Edinburgh)
Sonja Bezjak (Social Science Data Archives, University of Ljubljana)
Irena Vipavc Brvar (Social Science Data Archives, University of Ljubljana)
Naeem Muhammad (KU Leuven)
Pedro Príncipe (University of Minho Documentation and Libraries Services)
For longtime members of IASSIST, the annual conference has been a chance to overcome the isolation of doing data support in siloed institutions. Here, we look to the innovative European model of data stewardship: organised networks of trained stewards across institutions and infrastructures, based in competence centres or distributed by discipline, to guide researchers in making their data FAIR; and importantly, to share expertise and practices amongst data professionals. The geographically diverse, experienced panel is made up of leaders of data stewardship networks in institutions or national initiatives. Social Science Data Archives, University of Ljubljana, Slovenia – representing the Slovenian "SPOZNAJ" project consortium. Delivering training for future data stewards, preparing the Catalogue of competences for data experts, setting up a Data Stewardship network. University of Minho, Portugal – sharing the vision and plans at national level within the ReData.pt network that is establishing the Portuguese RDM support network alongside the RDM capacity hub, including the experience of Portuguese RDM Forum informal network and the University of Minho RDM and Open Data support services office. KU Leuven, Belgium – sharing insights from the data stewardship network at KU Leuven and the Flemish Research Data Network Knowledge Hub—a self-learning community of data stewards in Belgium's Flemish region.
Perspective on South Africa's role in science and research by analysing material digitised and collected by the Antarctic Legacy of South Africa
Maria Petronella Olivier (Antarctic Legacy of South Africa)
As a digital archivist I collect materials from South Africans that have been involved in the Antarctic region. This material covers the history of South Africa's involvement reveal important insights on the relationship of science and research done by South Africans. Diaries, description of images, letters, journal articles and documents are a huge source of information. Since 2009 material has been collected for the digital archive. Digital archiving of material for the Antarctic Legacy of South Africa repository has a lot to do with correct metadata and assessment and evaluation of material and need for understanding South African involvement in the Antarctic region is of utmost importance. Especially in the South Africa context ALSA has the responsibility to establish the correct date and history timeline of involvement. These factors led to in depth study of documentation and images to create a timeline since before the heroic age. The timeline started from a low impact for an exhibition and pan out to be part of the digital museum, posters for established museums, public lectures to specific groups. Initially these timelines focussed on the bases and the vessels. Since 2020 a renewed interest in South Africa's participation and collaboration in research has come forward. As archivist I have worked through documents already available on the archive to establish the timeline of research within the Antarctic region. In this presentation I shall highlight these timelines and how events in the world and South Africa had an influence on these timelines. This research timeline may lead to more research within the social sciences and humanities. ALSA has been maintaining the repository on DSPACE for more than a decade. The contribution of human involvement is more than pictures and diaries left behind, their work and life can enhance our perceptions of the polar environment.
Translating ELSST into Slovenian: Challenges in Supporting Open Science and FAIR Principles
Sergeja Masten (Slovenian Social Science Data Archives)
The European Language Social Science Thesaurus (ELSST) is a broad-based, multilingual thesaurus for the social sciences. It is owned and published by the Consortium of European Social Science Data Archives (CESSDA) and its national Service Providers. The Slovenian Social Science Data Archives (ADP) are dedicated members of Consortium CESSDA and its mission to promote open science and cross-national harmonization of archives. Translating the ELSST thesaurus into Slovenian is a critical step toward achieving these goals, as it directly supports researchers and the broader research community by facilitating access to multilingual data. These translations enhance the usability of data archives, making them more accessible and interoperable, in alignment with FAIR principles and the values of open science. When it comes to Slovenian, the language's complex morphology and syntactic variability present unique challenges for developing and maintaining thesaurus translations. Unlike languages with simpler grammatical structures, Slovenian has a rich system of declensions, dual forms, and extensive word compounding. These characteristics complicate the alignment of concepts across languages in ELSST, as terms often need detailed contextual adaptation rather than direct translation. Overcoming these challenges requires collaboration with local linguists and domain experts to ensure translations are both accurate and culturally relevant. Despite these constraints, ADP remains committed to advancing this work as part of their broader effort to foster collaboration, accessibility, and equity in social science research across borders. ELSST's continuous development in languages like Slovenian underscores the importance of linguistic diversity in global social science research. It enables more inclusive access to data and fosters equitable participation in international academic discourse.
Lynda Kellam (University of Pennsylvania Libraries)
In response to the removal of datasets and dismantling of statistical offices at the federal level in the U.S., a coalition of organizations has been working together to communicate and coordinate across data initiatives, provide support for those conducting data rescue events, and develop tools for tracking data rescues. This presentation will discuss the development of the Data Rescue Project as a distinct organization and its roots in the wider public data ecosystem. In addition, it will discuss the paths forward and the possibilities for future data infrastructure collaboration and support in the U.S. and other countries.
Applying Systematic Review Methodologies to Business Studies exploring the Five Safes Framework
Elizabeth Green (University of the West of England)
Son Hoai An Phan (University of Bath)
Abi Ward (University of the West of England)
Systematic reviews are a cornerstone of evidence-based practices, traditionally developed within the medical discipline to ensure high-quality and reliable evidence selection. The methodology underpinning systematic reviews in medicine leverages controlled vocabularies, subject indexing, taxonomies, and standardized protocols to achieve structured and reproducible results. However, when applied to business studies, significant challenges emerge due to the lack of comparable structure, including the absence of controlled vocabularies. This paper investigates these challenges through a two-part study. The first part of the study undertakes a systematic review of research using the Five Safes governance framework, focusing on data management strategies and their efficacy. Employing the PICO framework, the study conceptualizes research as the population, the Five Safes framework as the intervention, and examines outcomes such as data management quality, accessibility, reuse potential, and citation metrics. Comprehensive Boolean search strategies are used across databases such as ASSIA, EMBASE, and EconLit to identify relevant literature, yielding 1,459 articles. Following deduplication and screening, 1,244 articles are analyzed. The second part reflects on the methodological contrasts between conducting systematic reviews in business studies versus health sciences. Key themes include the adaptability of systematic review protocols, challenges in defining search taxonomies for business studies, and the impact of interdisciplinary practices on data governance frameworks. By explicitly exploring projects that employ the Five Safes, this study highlights the governance framework's role in ensuring data accessibility, privacy, and safe reuse. The findings underscore the need for adapting systematic review methodologies to accommodate the less-structured nature of business studies literature. Furthermore, this research contributes to the growing dialogue on interdisciplinary data governance and provides practical recommendations for applying rigorous review methods in non-medical fields.
Benefits of Terminologies for Interdisciplinary Research
Claudia Martens (German Climate Computing Center (DKRZ))
Aenne Löhden (German Climate Computing Center (DKRZ))
Markus Stocker (Leibniz Information Centre for Science and Technology University Library (TIB))
Our Earth System is a complex and dynamic network involving interactions between the atmosphere, oceans, landmasses, and biosphere. Being cross-disciplinary at its core, research in Earth System Science comprises divergent domains such as Paleontology, Marine Science, Biodiversity Research, Atmospheric Sciences, Molecular Biology or Plate Tectonics. Given the exponential growth of data due to technological developments along with an increased recognition of research data as relevant research output during the last decades, fundamental challenges in terms of interoperability, reproducibility and reuse of scientific information arise. Within the various disciplines, distinct methods and terms for indexing, cataloguing, describing and finding scientific data have been developed, resulting in a large amount of controlled Vocabularies, Taxonomies and Thesauri. However, given the semantic heterogeneity across scientific domains (even within the Earth System Sciences), effective utilisation and (re)use of data is impeded while the importance of enhanced and improved interoperability across research areas will increase even further. The BITS Project (BluePrints for the Integration of Terminology Services in Earth System Sciences) aims to address the inadequate implementation of encoding semantics by establishing a Terminology Service that may serve the whole Earth System Science Community on national, european and international level. In our presentation we would like to showcase the benefits of Terminologies not only for the Earth System Sciences but also for enhanced interdisciplinary discovery of research output including Social Sciences and Humanities. The inclusion of the human factor in Earth System Science is an essential part of future climate model projections, and the extent to which different tools can contribute to this remains the great challenge of our time. We hope to give at least a starting point for that.
Let's chat about data! A study of data discovery with Large Language Models
Anja Perry (GESIS - Leibniz Institute for the Social Sciences)
Christin Kreutz (GESIS - Leibniz Institute for the Social Sciences)
Tanja Friedrich (German Aerospace Center)
The search for reusable data for scientific work has been the subject of intensive research for some time. A fundamental realization from these works is that the search for data differs from the search for literature in various respects. Research data, unlike literature, appear in a very wide variety of file formats and differ in form and content, depending on the field of research from which it originates and the methods and instruments used to generate or collect it. The existing published data cannot be meaningfully indexed in its entirety by any search engine and therefore always requires a textual description (metadata or documentation). Studies show that researchers learn about reusable research data indirectly: either from literature in which the data is cited or from exchanges with other researchers, for example in their research group or at conferences. For researchers, searching for data on the web primarily works for data they already know (known-item search). They tend to learn about new data through more intensive engagement with a research topic, either by reading articles or in conversations with other researchers ("data talk"). Large Language Models (LLM) could help in mitigating problems with web searches for new data by providing the opportunity to pose clarifying questions or ask for explanations. In our work, we observe data search behavior and use concurrent think-aloud to capture the thoughts and strategies of participants while performing search tasks using an LLM. During the second of two data search tasks we provide each participant with a prompt for the LLM to act as a fellow researcher who is talking with the participant about their data search. In our contribution we will present first results from our study and derive implications on how data repositories can prepare for increasing data search via LLMs.
Bridging Complex Data and Generative AI for Data Retrieval: Methods, Challenges, and Opportunities
Vassilis Routsis (University College London)
Data services aim to provide accessible and intuitive tools for researchers and policymakers to navigate an increasingly complex and diverse data landscape. Modern interfaces have significantly improved accessibility but often still require considerable technical expertise and domain-specific knowledge, perpetuating barriers to effective data discovery and retrieval. CORDIAL-AI, a pilot project funded under the ESRC Future Data Services programme, explores the potential of Generative AI (GenAI) to address some of these barriers by enabling users to interact with a Large Language Model (LLM) for retrieving complex, bespoke UK census flow data. Flow data is the most complex type of census data, characterised by its substantial size, extensive code lists, large volumes of numerical information, and intricate relational structures. It provides a compelling case study to examine how LLMs can assist users in identifying and extracting tailored subsets of such data while highlighting the significant challenges GenAI faces in handling highly structured datasets and large-scale categorical and numerical variables. The pilot seeks to equip UK Data Service (UKDS) staff with the technical expertise and transferable skills necessary to engage with this rapidly evolving technology, enabling future data services to adapt to new advancements and become more efficient. From a technical perspective, the project builds on a newly developed experimental census API with advanced subsetting capabilities as part of the UKDS's 2024–2030 data-driven strategy. The presentation will detail the methodologies employed to construct reliable pipelines between structured datasets and GenAI systems, leveraging the API alongside advanced techniques in prompt engineering, natural language processing, AI agents, and model fine-tuning. It will reflect on practical insights from applying advanced GenAI techniques to data retrieval, offering perspectives on how such approaches can shape the development of innovative tools for data access.
Increasing Data Literacy Instruction through Improved Assessment and Outreach: A Case Study
Whitney Kramer (Cornell University Library)
Academic librarians understand the importance of critical data literacy skills for undergraduate students, but it can be a challenge to convince faculty members that it is necessary for their students to cultivate these skills in the classroom environment. Due to recent curricular changes at Cornell University's School of Industrial and Labor Relations (ILR), the staff at Catherwood Library have seen a significant increase in the number of data literacy-focused reference questions and consultations. By analyzing existing reference and instruction-related assessment data, we identified specific economics and statistics classes where reference interactions spiked after the new curriculum was introduced, indicating that the students would benefit from increased data literacy instruction in this area. However, faculty in these departments rarely took advantage of existing library instruction offerings, and it took time and effort to get them on board with integrating these much-needed in-class sessions into their existing instructional plans. This presentation will focus on the efforts at Catherwood Library to increase the number of data literacy-focused one-shot sessions in economics and statistics classes as part of our reference and liaison program. Key successes and challenges will be discussed, as well as data analysis and assessment strategies, future plans for expanding this work, and additional suggestions for librarians who are interested in implementing data-driven reference, instruction, and outreach at their own institutions.
Data for Everyone: A Collaborative, Skill-Oriented Model for Engaging the Community
Heather Charlotte Owen (University of Rochester)
Sarah Siddiqui (University of Rochester)
The motto for Data Services at the University of Rochester's Libraries is that "Data is for everyone". To facilitate this, staff offer a variety of support to engage the entire university community. Instead of subscribing to a subject-specific model, members of Data Services cover the entire research data lifecycle by specializing in different skills. First, a Data Librarian reviews data management and sharing plans, offers guidance on data management and sharing best practices and federal funders, and leads data literacy initiatives. A Reproducibility Librarian supports the next section of the lifecycle for processing, analyzing and sharing works openly in a reproducible way through electronic laboratory notebooks, coding, data visualization, APIs, etc. Finally, a Data Curator closes the lifecycle by administrating the institutional repository and assisting researchers with curation and preservation. Our service, by being skill-oriented and discipline-agnostic, allows us to support all disciplines. It is our marketing strategy, however, that ensures we reach the full community. While we utilize the normal gamut of marketing strategies such as newsletters, flyers, and listserv emails, the key to our outreach is offering public educational opportunities. From an annual data visualization contest to a wide array of workshops for both discipline-specific and discipline-agnostic skills, we promote awareness of our service for researchers across the university. An increase in consultations and tickets often follows an event, and we can launch a new service by hosting a related workshop. We also employ students to reach out to more departments and clubs across the university. In this presentation, we will discuss the unique strategies that shape our offerings to reinforce that the Libraries Data Services serve everyone at the university. We will further reflect on the strengths and challenges over the first year and half of the model and explore next steps to better serve our community.
Advancing Data Access and Training: The Impact of the Data Liberation Initiative in Canadian Universities and Colleges
Elizabeth Hill (Western University)
Alexandra Cooper (Queen's University)
Siobhan Hanratty (University of New Brunswick)
Since 1996 , the Data Liberation Initiative (DLI) has had a significant impact on how data has supported teaching, research, and publishing in Canada. A partnership between post-secondary institutions and Statistics Canada, an early goal of the DLI was to reduce financial barriers to using Canadian data, but by its very nature, the program also developed a robust communications network and training workshops to respond to DLI Community members' needs. The program has recently undergone a strategic review which found that, at its core, the value and strength of the DLI is in the training and knowledge exchange of its members' community. Alexandra Cooper, Siobhan Hanratty and Elizabeth Hill have supported the Data Liberation Initiative at their institutions since 2003, 2001 and 1998 respectively, and have held leadership roles within the DLI Professional Development Committee since 2018, 2009 and 2010. Revisiting an IASSIST presentation from 2018, the presenters will share a deeper examination of this program and its importance to the academic community. They will explore how data services in Canadian post-secondary institutions have changed over time, how the topics of training at regional and national DLI workshops reflect these changes within the broader data and research community, and report on a new research project that will utilize survey results of librarians and specialists from the DLI community, literature reviews, an analysis of existing training materials, and the DLI strategic review.
June 5, 2025: Concurrent Session F. Panel
It Takes a Campus: building cross-campus collaborations to support research computing and data needs
Moira Downey (North Carolina State University)
Susan Ivey (North Carolina State University)
Erin Foster (University of California, Berkeley)
Wind Cowles (Princeton University)
Sophia Lafferty-Hess (Duke University)
As research generates greater volumes of data, the data management and storage solutions historically relied upon by researchers can prove inadequate. Simultaneously, the infrastructure required to meet the demands of data management – storage, transfer, analysis, visualization – at scale often involve skill sets outside those held by many traditional data services professionals. Given finite bandwidth for staff retraining, one strategy that institutions are employing to address these accelerating needs is to foster connections and collaboration with the campus information technology and research computing professionals charged with building and maintaining this infrastructure. While institutions explore these partnerships, there has been a concurrent proliferation of organizations focused on supporting the development and professionalization of the Research Computing and Data (RCD) field. Groups like the Campus Research Computing Consortium (CaRCC) and EDUCAUSE Research Computing and Data Community Group take a broad view of RCD, addressing topics ranging from containerization and the cloud, to data management and movement, to training and education for users, including deep and sustained engagement to help researchers achieve their goals ("research facilitation"). Originating in the operational needs of high performance computing, these communities have gradually expanded their reach into libraries, acknowledging the interdependent nature of support for research computing and data management and the unique expertise that data services professionals contribute. This panel will bring together in conversation four institutions that have drawn inspiration from groups like CaRCC to develop cross-campus models designed to meet changing RCD needs. The University of California, Berkeley, North Carolina State University, and Princeton have developed services that leverage a range of RCD skills to help researchers navigate increasingly complicated technological and regulatory landscapes, and will share their experiences with service development, challenges, opportunities, and lessons learned. The conversation will be moderated by Duke University, which has recently begun its own RCD initiative.
Balancing FAIR Data Principles with GDPR Compliance: A Guide for Researchers
Ina Nepstad (Sikt - Norwegian Agency for Shared Services in Education and Research)
Irena Vipavc Brvar (ADP - Social Science Data Archive)
Lisa Tveit Sandberg (Sikt - Norwegian Agency for Shared Services in Education and Research)
Simon Gogle (Sikt - Norwegian Agency for Shared Services in Education and Research)
All research data should be FAIR—Findable, Accessible, Interoperable, and Reusable—and as open as possible. However, much research data includes personal data, which is subject to GDPR (General Data Protection Regulation). Many researchers are concerned that GDPR may conflict with the FAIR principles, especially regarding the sharing, accessibility, and reusability of personal data. This presentation demonstrates how personal data can be both legally compliant and FAIR when GDPR is correctly applied. We will explore key aspects of GDPR that are particularly relevant to research, including exceptions that allow for data storage and archiving beyond standard retention periods. The presentation will also cover potential legal bases for processing personal data in research. To ensure that research data is both FAIR and GDPR-compliant, researchers are encouraged to conduct a GDPR assessment early in their projects, select the appropriate legal basis for processing data, and maintain transparency with participants about how their data will be used. This session will address common challenges and offer practical advice on planning and documenting research processes to ensure data is accessible, protected, and ready for sharing and reuse in compliance with relevant regulations. By understanding the flexibilities offered by the GDPR, researchers can ensure that their data meets both legal requirements and the FAIR principles. This presentation will explore how to balance data accessibility with privacy concerns while ensuring compliance with both FAIR and GDPR frameworks. Attendees will leave with actionable strategies for aligning research practices with these standards, ensuring data is legally compliant and reusable.
Attitudes toward Data Management and Repository-based Approaches to Sharing Sensitive Data among Researchers from Different Disciplines
Dessi Kirilova (Qualitative Data Repository)
Derek Robey (Qualitative Data Repository)
Sebastian Karcher (Qualitative Data Repository)
The broader project seeks to understand challenges researchers face when they seek to simultaneously share data they collect from human participants, while also fulfilling their ethical obligations to human participants and following all relevant privacy and confidentiality regulations. Meeting these goals can be facilitated by the use of a variety of privacy-preserving tools (PPTs) and resources. In this presentation, we examine researchers' perceptions of data sensitivity in different research scenarios. Specifically, we will share quantitative results from an online survey of 200 purposefully recruited respondents from across 12 social science and biomedical disciplines, asking about their familiarity with and propensity to use resources and techniques that might enable better data management, as well as appropriate sharing of sensitive data (ex: data use agreements, virtual enclaves, etc.). Additionally, we discuss respondents' reports on their actual experiences with planning for data sharing in the context of a concrete research project and their reports on whether they actually shared the data, what barriers they encountered and what information resources they used in the process. We will also discuss common themes in respondents' free-text answers to a question about how they personally define research data sensitivity. We highlight common challenges researchers perceive in planning for data sharing of what they consider to be sensitive data, which is of relevance to the work of research data librarians. Relatedly, the presentation also highlights possible avenues for both general assistance and project-tailored guidance data, which librarians are well-poised to offer. The overarching goal of the project is to inform next steps in the development of technological and operational approaches that a variety of stakeholders in the academic research data landscape (digital repositories; academic journals; funders; as well as data-related library and IT services and IRBs at universities) can offer to researchers for their data-related research needs.
Archiving and Publishing Digital Behavioral Data
Oliver Watteler (GESIS - Leibniz Institute for the Social Sciences)
Katrin Weller (GESIS - Leibniz Institute for the Social Sciences)
Jan Schwalbach (GESIS - Leibniz Institute for the Social Sciences)
Since 2013 Gesis has been actively involved in and managed project covering 'digital behavioral data' (DBD). This concept encompasses digital observations of human and algorithmic behavior which are, amongst others, recorded by online platforms like Facebook or the World Wide Web, or by sensors like smartphones, or RFID sensors. It focuses on the societally relevant aspects of human and algorithmic behaviour and the research perspectives derived from this. Since 2022, GESIS has strategically shifted its focus more towards digital behavioral data, and based on a special purpose grant was also able to intensify its engagement in this area. The methodical research on the collection, analysis, and quality of DBD at GESIS is accompanied by the development of new services. This presentation gives insights into the technical, organizational, and legal challenges that had to be overcome and the solutions Gesis has developed so far to archiving and publishing this type of data collected by internal and external projects. It covers the legal bases Gesis considers, the workflow to ingest new digital behavioral data, the handling of large data quantities, the expansion of existing metadata schemas, as well as the access to the data on-site and (perspectively) off-site. While some aspects are still under development, we present the current state, focusing on two data types as pilots: data collected from online platforms (social media), and data collected via web tracking browser plugins from study participants who donated their web browsing histories.
Session theme/info: Ethics, Governance, CARE / FAIR
Data Documentation Initiative (DDI) and Training Working Group: Who, What, and How
Kathryn Lavender (Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan)
Catherine Yuen (Institute for Social and Economic Research, University of Essex)
Chantal Vaillancourt (Statistics Canada)
This presentation introduces the Data Documentation Initiative (DDI) and the training materials that have been developed by the DDI Alliance Training Working Group (TWG). The DDI Training materials assist a variety of audiences and experience levels in understanding DDI products and implementing them in their organization. The first version of DDI was published nearly 25 years ago, and since then there have been many dedicated DDI enthusiasts working together to promote understanding of data, metadata, and standards through the use of DDI. DDI is an open international standard, for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences. It can also be used to document and manage different stages in the research data lifecycle. The DDI Training Working Group maintains a training library using Zenodo and YouTube, with some content available in multiple languages, such as French and Korean. These resources are freely available, to serve individuals new to and familiar with DDI. This presentation will provide some history of DDI and the DDI TWG, and guide attendees to existing and recently developed DDI resources.
Developing and testing harmonisation workflows for comparative survey data using DDI – a WorldFAIR case study
Steven McEachern (UK Data Service, University of Essex)
Hilde Orten (Sikt)
Ryan Perry (Australian Data Archive)
Kristina Strand (Sikt)
DDI Lifecycle and DDI-CDI provide significant capabilities for the integration and harmonisation of content across datasets. As part of the recently completed WorldFAIR project lead by CODATA, a team from the Australian Data Archive (ADA) and Sikt lead a work package to examine ways for improvement of FAIR practices in the management of harmonised content in cross-national social surveys. This work was completed in three stages – a review of comparative survey data management practices at Sikt and ADA; development of a human and machine-actionable workflow for harmonisation of social surveys (the Cross-Cultural Survey Harmonisation workflow – CCSH) that leverages DDI and other standards; and a proof-of-concept test of the CCSH workflows leveraging services available at ADA and Sikt through their respective Colectica registries. Overall, the pilot demonstrated that the CCSH workflow forms a viable foundation for standardising and progressively automating the process of survey data harmonisation. However the pilot also showed that there is still a significant degree of human manual input required – and thus has more work to do to be truly FAIR. We thus provide recommendations for data managers and the Alliance as to how more integration and automation might be achieved in future.
Creating a Custom Concept System to Document Longitudinal Studies
Jennifer Zeiger (ICPSR)
The National Archive of Computerized Data on Aging (NACDA) began working with DDI-Lifecycle in 2018. Since then, NACDA has made efforts to document in DDI-Lifecycle some of our most established and frequently-used longitudinal data collections and display them on a Colectica Portal. In this presentation, I will describe NACDA's efforts to develop a system of standardized taxonomy of topics by which to organize conceptual variables in the cross-wave, cross-series, and multi-series comparisons available on the NACDA-Colectica portal. The discussion will include how organizing conceptual variables in these data collections into topical groups improves the interoperability and findability of the metadata on our portal, how the system has evolved over time, and our plans for its improvement and use in the future.
Transparent decision-making in research data governance: The Leeds approach
Sally Dalton (University of Leeds)
Helen Blomfield (University of Leeds)
Effective data governance is critical for ensuring responsible management and sharing of research data. At the University of Leeds Library, we conducted a review of data governance practices for deposit and access requests to our Research Data Repository, aiming to establish a robust and transparent framework for decision-making. Our goal was to collaborate with key stakeholders across the University - drawing on expertise from Research Ethics, Information Governance, Legal Services, and the Library - to develop a governance framework that supports complex decisions around FAIR data sharing. As part of this process, we created guiding principles to help researchers in navigating key considerations when requesting to deposit data, particularly in cases where participants may not have been fully informed about the future use of their anonymised data, due to contradictory or unclear consent wording. A significant outcome of this work was the establishment of the Data Access and Retention Group (DARG), a multidisciplinary sub-group of the Information Governance Oversight Group. DARG reviews data deposit and access requests for datasets with higher sensitivities, ensuring decisions are informed by appropriate expertise and governance. To support this, we implemented a three-level access model: open, restricted, and controlled. Access requests for controlled level datasets, which often involve higher sensitivities, are escalated to DARG for formal review and oversight. This presentation will outline the steps taken to improve data governance, highlight challenges encountered during the process, and evaluate the impact of the new model on promoting responsible and transparent data sharing.
Data governance training for low- and middle-income countries
Felix Ritchie (UWE Bristol)
Pedro Ferrer Breda (UWE Bristol)
Subash Gajurel (Kathmandu Medical College)
Elizabeth Green (UWE Bristol)
Dio Kordopati (UWE Bristol)
Much social science research is based around personal data, and so understanding data governance is key to collecting and using that information safely. This is a problem, as research methods courses typically focus on a few elements (ethical approval, data protection law) and skip over or ignore issues such as disclosure control, data management or staff training. The problem is more severe in low- and middle-income countries (LMICs) where (a) resources are more limited (b) there may be fewer off-the-peg resources (template consent forms, common wording for data sharing agreements etc) to call on. In 2020 the UK National Institute for Health and Care Research (NIHR) funded, as a proof-of-concept, a 'Summer school' for researchers on its Global Health Programme. This built upon an earlier NIHR-funded exercise in training researchers in Nepal. The summer school was well received and generated a wealth of new ideas and approaches for the teaching team. NIHR agreed to fund a further series of virtual courses, running in Spring and Autumn 2022-2026. As well as delivering the course, the team were requested to develop teaching materials which could be adapted and used by others. This paper describes the lessons learned from running the course over the first two years. The learning comes from three sources: - Developing materials to cover cradle-to-grave data governance and management, from data collection to post-project distribution strategies - Applying this to radically different situations in a consistent manner to allow effective learning for all - Identifying common issues for researchers in LMICS, such as data colonisation. This research was funded by the NIHR (NIHR150089) using UK aid from the UK Government. The views expressed in this presentation are those of the authors and not necessarily those of the NIHR or the UK Department of Health and Social Care.
Making Data Governance an ongoing activity: a case study from Nepal
Subash Gajurel ()
Felix Ritchie ()
Dionysia Kordopati ()
Sunil Kumar Joshi ()
Julie Mytton ()
Data governance (DG) plays a vital role in the research process, enhancing accountability, data quality, and risk management to ensure sustainable research outcomes, especially in resource-limited settings like low- and middle-income countries (LMICs). However, DG is often seen as a one-time planning task rather than an ongoing process, and its effectiveness is rarely considered. This paper reviews how DG was developed for a large research programme (SafeTrip Nepal), focusing on how it was developed into an ongoing review, and how we set about formally evaluating it. SafeTrip (2022-2026) is funded by the UK National Institute for Health and Care Research to improve road safety in Nepal. The funding explicitly included the development of DG capability in the Nepalese partners. DG initially focused around in-person training and developing DG plans for each of the four workstreams within SafeTrip. Following the initial planning stage, the DG team began to explore how DG could be embedded into operations more actively. Accordingly, the team introduced a schedule of ongoing light-touch review, spread across four key stages: planning, data collection, analysis, and post-project activities. This proved effective at highlighting inconsistencies between plans and outcomes, early enough to implement mitigation measures. In Autumn 2024 the team evaluated the effectiveness of DG more formally. A survey and one-on-one interviews were conducted across the SafeTrip team. A mixed-methods approach was employed to assess the plan's impact on risk management, ethical compliance, and overall project support. Overall, the DG plans effectively mitigated risks, ensured ethical compliance, and maintained data security. The "Five Safes" framework enhanced accountability, data quality, participant trust; flow diagrams facilitated data management. Challenges, such as the initial technical issues with data collection tools and the need for software adaptation, were resolved through flexible platform adjustments and collaborative problem solving. Respondents highlighted the need for regular training.
ASSURED - ensuring safe research by safe people
Deb Wiltshire (GESIS-Leibniz Institute for the Social Sciences)
Simon Parker (GHGA)
Vanessa Gonzalez Ribao (GHGA)
Wiebke Weber (BERD@NFDI)
Markus Herklotz (BERD@NFDI)
Germany has a large Research Data Centre infrastructure that facilitates the analysis of complex, highly sensitive data. This infrastructure is being further developed to expand the range of research communities served. Alongside this growing infrastructure is a need to foster awareness about the importance of using sensitive, potentially disclosive data responsibly. Researchers and data professionals must acquire special skills to handle these data in an ethical and efficient way. A user survey found over 60% of respondents are interested in learning more about data access and sharing. Many institutions offer data protection training, but this is typically generalist in nature and does not cover key topics such as understanding data access, and how to produce non-disclosive research outputs. Research Data Centres may offer more specialist training for researchers accessing their data, but this is not recognised across services, so researchers accessing data from multiple infrastructures must train multiple times. To address this, we're developing ASSURED: an adaptable E-Learning training and accreditation system that promotes best practices for safe data use and sharing, further fostering career-development within the research data sector. ASSURED is currently focused upon researchers and data professionals in Germany but has the potential to be expanded across Europe in alignment with the objectives of the European Open Science Cloud. ASSURED provides core modules that cover the key knowledge needed to work with sensitive data safely. Additional modules provide specific skills for working in a particular secure environment, with different data types such as genetic data, or in specific roles such as researcher or data access professional. The short, compact modules are supplemented with activities and assessments. By integrating the training with an Authentication and Authorisation Infrastructure, those trained using ASSURED will be able to easily demonstrate their accreditation and have it recognised at multiple Research Data Centres.
Highlighting Gender Bias and Misaligned Expectations in Job Advertisements for Data Professionals
Elizabeth Green (University of the West of England)
Hilary Lowe (University of the West of England)
The Future Data Services project, "Optimizing Data Professional Success: Identifying Skills, Career Trajectories, and Training Requirements for Enhanced Data Service Delivery," aims to improve training and development for data service staff through skill mapping, curriculum design, and equitable recruitment strategies. This paper explores implicit gender bias in job advertisements for data professionals, applying Gaucher, Friesen, and Kay's (2011) lexical analysis framework to identify gendered language. Building on the Future Data Services Report (Green, 2024), which analyzed 315 UK job advertisements, the project identified key misalignments between advertised roles and job realities. Person specifications often emphasized technical skills, such as research and data expertise, while interviews with professionals highlighted the importance of soft skills like customer service. Many job descriptions appeared to derive from boilerplate templates, especially in academic institutions, causing mismatches between expectations and actual responsibilities. This research is contextualized within prior studies, including Thielen and Neeser (2020), which documented the shift toward hiring data professionals outside traditional librarian pipelines in U.S. academic libraries. Additionally, Hu et al. (2024) demonstrated how job advertisement language can reinforce or disrupt gender and racial inequalities in the labour force. The paper provides a comprehensive analysis of gendered language in job advertisements for data professionals, identifies systemic biases, and offers recommendations to aligning job advertisements with equitable career opportunities in data services.
India's Data Contributions in Generalist Research Data Repositories: Challenges and Opportunities
Pallab Pradhan (Information and Library Network (INFLIBNET) Centre, Gandhinagar)
Lavji N Zala (Sardar Patel University, Vallabh Vidyanagar, Gujarat)
The paper briefly outlines India's policy landscape on research data management and open data sharing. Most scientific research, advancements, and innovations happening across disciplines/fields in Indian academic and research organizations are being funded by the Government of India under its various ministries, departments, and autonomous agencies/bodies/councils/boards. However, though policies have existed, there is still a lack of proper organizational infrastructure aligned with those policies to support open data sharing. Further, the paper provides a quantitative analysis of India's contribution to Generalist Data Repositories in terms of data/datasets deposited/contributed by Indian academicians and researchers, under which key disciplines, and by which organizations. The study covers the seven data repositories, i.e. Dataverse, Dryad, Figshare, Mendeley Data, Open Science Framework (OSF), Vivli, and Zenodo, under the Generalist Repository Ecosystem Initiative (GREI), including Open Data Bank and IEEE DataPort. The number of data/datasets deposited from India by Indian contributors was extracted by searching and filtering from those portals' search interfaces. It was found that not all those repositories offer proper search and retrieval options for extracting "country-wise" data directly. Thus, other means, such as the use of API and direct data request methods to the repositories' administrators, were employed by the researchers. Furthermore, through this study, the researchers argued that institutional and infrastructure challenges, the lack of widespread awareness and adoption of data management policies, limited funding support for data sharing initiatives, lack and gaps in technical know-how for data curation, understanding of proper metadata and data formats, and lack of mandatory data sharing requirements in research grants by funding agencies, are maybe some of the sole reasons. The study suggests that updated institutional policies, the framing of national policies for supporting open data and sharing, and incentivizing data sharing with adequate awareness, capacity building of data management are the needs and opportunities.
The Crash of Sound Waves: Data Engagement Through Sonification
Will Clary (University of Michigan)
This presentation seeks to explore the promise of data sonification to enhance engagement and accessibility in data analysis. Sonification represents an engaging and potentially transformative approach to understanding complex datasets, particularly those that are multivariate, time-varying, or logarithmic in nature. By converting data into sound, we facilitate global pattern recognition and enable users to uncover intricate relationships and trends that remain hidden in traditional visualizations. Sonification introduces elements of creativity and artistry, thus humanizing analytical processes and pushing users to "feel" the data. It is a powerful complement to visual displays and adds a multisensory dimension that enriches data exploration. Dynamic auditory feedback paired with interactive graphs provide further understanding by combining sight and sound. Additionally, sonification plays a crucial role in promoting inclusivity, particularly for visually impaired individuals. Innovations like "sound-graphs" transform visual data into auditory representations, making inquiry more accessible. Through three original examples of sonifications, this presentation explores the sublime sonic space between science and art while introducing easily accessible tools that aid data professionals in developing alternative ways to access information, fostering deeper data engagement, and encouraging insight through innovative perceptualization techniques.
Using the UK Census Longitudinal Studies: Census linkage through to opportunities for Longitudinal Research and Comparative Analysis
Lee Williamson (University of Edinburgh)
Lynne Adair (University of Edinburgh)
Stephen Jivraj (UCL)
This session will introduce the 3 UK Census Longitudinal Studies data and the 3 user research support units (RSU) - the Centre for Longitudinal Study (LS) Information and User Support (CeLSIUS), the Northern Ireland LS (NILS) RSU (NILS-RSU) and the Scottish LS (SLS) Development Support Unit (SLS-DSU). It will discuss their size and scope, and recent linkage of the 2021/2 Census data. Arrangements for accessing the data from the 3 RSUs and some key areas for research will also be highlighted. Each of the 3 LS take a sample of the population from Census data and follow it across time, linking in administrative data and with capacity to link further data at low level geographies. The LSs are not based on voluntary surveys; they provide unparalleled coverage and sample sizes which allow research using risk factors and outcomes often unavailable from other sources. The ONS LS has 50 years of follow-up 1971-2021, it follows a 1% sample of the England & Wales population linked to births, deaths and cancer registration data. The SLS has 30 years of follow-up 1991-2022, and linkages include health, education and environmental data, births, deaths, marriages. The SLS covers a 5% sample of the Scottish population. The NILS covers 28% of their population with 40 years of follow-up 1981-2021 and linkages to datasets such as health (including prescribing data), births, deaths, marriages and property data. The session will showcase the opportunities for longitudinal research and comparative analysis. It will discuss the similarities and differences between the studies and also highlight the 2021/2 Census linkage. This linkage provides researchers with opportunities to examine a range of new topics owing to new questions introduced and to examine socio-economic and demographic changes taking place since the 2011 Census, a 10-year period that has seen Brexit and the Covid-19 pandemic.
June 6, 2025: Concurrent Session H.1
Integrating Open Research Data into Research Assessment: Insights and Challenges
Pedro Araujo (FORS)
In recent years, calls for reform in research assessment have intensified, with international initiatives such as DORA or CoARA. These efforts advocate for moving beyond mainstream metrics, particularly the controversial over-reliance on bibliometrics, towards innovative evaluation approaches that recognize diverse research outputs. This shift aligns closely with the Open Science movement, where practices like Open Research Data (ORD) are receiving growing attention. Increasingly, FAIR data production and sharing are expected by universities and funding bodies, prompting to include ORD in the assessment of scientific production. To support researchers in this transition, higher education institutions (HEI) are developing infrastructures, trainings and services in Research Data Management (RDM) to facilitate the integration of ORD into everyday academic practices. However, integrating ORD into research assessment frameworks remains complex and raises concerns, particularly in the social sciences, where the diversity of data types and their sensitive nature often complicates data sharing. What are the main challenges HEIs face in embedding ORD into evaluation processes? How can these challenges be addressed to ensure effective and equitable research assessments? Experts in RDM play a crucial role in tackling these issues by identifying challenges to the formal recognition and assessment of ORD. This communication draws on findings from a survey conducted across 30 Swiss higher education institutions, targeting individuals involved in research assessment and ORD policy implementation. While most institutions have some strategies in place, dedicated funding and comprehensive assessment mechanisms remain scarce. Moreover, the analysis identifies four types of barriers–financial, technical, social and epistemic– that impede the recognition of ORD as a legitimate research output. Based on these findings, we will offer a series of recommendations to address these challenges and contribute to the ongoing discussion on applying and assessing ORD to the specificities of social sciences.
Preserving Today's Public Geospatial Data for Future Researchers
Karen Majewicz (University of Minnesota)
What happens to public geospatial data when it isn't systematically archived? How do nations ensure that today's datasets remain accessible as part of the historical record? Can we balance the urgency of providing real-time data with the responsibility to preserve it for future researchers? To explore these questions, this paper compares how countries manage and preserve geospatial data over time, focusing on two primary governance models: decentralized and centralized. Decentralized systems, such as those in the United States, rely on local governments to manage data. This approach often leads to inconsistent access and preservation across regions. Centralized systems, like the European Union's INSPIRE Directive, emphasize standardized data-sharing practices across all levels of government. While this approach achieves consistency, it may not satisfy diverse local needs. Hybrid models that combine elements of both approaches are also examined. This paper introduces a comparative framework to evaluate key aspects of spatial data infrastructures, including metadata standards, open versus restricted data, temporal coverage, retention policies, geoportal technology, and governance structures. The analysis identifies where these systems excel and where they fall short in addressing data preservation. Libraries, with their expertise in curation and long-term stewardship, can play a role in bridging these gaps by supporting intentional archiving practices. It concludes that such practices will ensure that today's geographic information remains accessible and valuable for future use. This session invites attendees to reflect on their own national systems and contribute to a global dialogue on building better strategies for managing and preserving geospatial data for generations to come.
Pragmatic Interoperability - Consistent Usage in Metadata Standards
Dan Gillman (US Bureau of Labor Statistics)
The term interoperability was first applied to the parts needed to maintain military equipment during WWII. The same axles, for instance, could be used in jeeps, personnel trucks, etc. For IT systems, the term addresses the ability to use a resource without outside help. This applies to systems, structures, representations, and meanings. Combinations of these imply data can be interoperable. With the adoption and use of metadata standards such as those in the DDI family, another form of interoperability is needed. Many of the DDI specs are complex, and it turns out the same descriptive problem can be handled in several ways in some situations using these standards. A second problem is the multiplicity of uses of some aspects of the standards. For humans reading the content in an application, these differences are surmountable even though they are annoying. Machines don't have that flexibility. And with the current emphasis on machine actionable metadata, the issue needs to be thoroughly understood. If we think of the specification, the individual elements, and their content as the constituents of declarative sentences, what is being said by this combination is different in each case. This can be characterized as pragmatics, thus we are interested pragmatic interoperability. When the same content is managed through different parts of a specification, different sentences result. In this talk, we define what is meant by pragmatic interoperability, provide examples of it, and propose guidelines for how to avoid the problem.
June 6, 2025: Concurrent Session H.2
Synthetic data fidelity: how less can be more
Jools Kasmire (UK Data Service / University of Manchester)
Synthetic data is generated rather than observed and includes stuff that someone makes up on the spot, random numbers generated by simplistic code, predictions made by complex machine learning models, the output of sophisticated digital twin simulations and much more. An important related concept, fidelity captures how "faithful" a synthetic data set is to a real-world counterpart. As such, fidelity is often seen as an important, if not the most important, feature of synthetic data. Yet, fidelity is not binary; a synthetic data set can be very faithful in some ways while wildly unfaithful in others, with the specifics of its fidelity determining its usefulness. For example, if synthetic data is intended to fix gaps or biases in real-world data sets, then it must be deliberately unfaithful to the original in at least some specific ways. At the same time, not all synthetic data sets try to mimic, replicate or augment existing real-world data sets and may not even use any real-world data within the synthetic data generation process. As such, fidelity (and especially high fidelity) is not always as important as might be assumed. This talk introduces and defines what synthetic data is and is not before diving into the role of fidelity, before highlighting common use cases, generation methods and concerns around synthetic data at varying levels of fidelity. When used appropriately to link a method, data set and research question, synthetic data can provide a valuable alternative to real-world data in situations where real-world data is unavailable, restricted, or unknown. Importantly, synthetic data is especially useful to enhance reproducibility and transparency in research by balancing data utility against privacy protection as well as to facilitate hypothesis testing, method development.
Data on the Margins: Topics and Concepts in LGBTIQ+ Data
Jonas Recker (GESIS - Leibniz Institute for the Social Sciences)
Anja Perry (GESIS - Leibniz Institute for the Social Sciences)
LGBTIQ+ people are considered a 'hidden population' by demographers. Data about this population is missing not only due to a lack of awareness on behalf of those collecting data, but also due to political resistance, laws that criminalize and persecute non-cisgender and non-heterosexual people , stigma and discrimination (Colaço & Watson-Grant, 2021). Data gaps such as this one exist due to unequal power relations (D'Ignazio & Klein, 2020). They both perpetuate and result in a dominance of male, white, hetero, and cis perspectives in how we make sense of and interact with the world. A first step towards identifying and closing data gaps is to take stock of data that already exists. In our project "Data on the Margins", we identified all LGBTIQ+ datasets held in European Social Science Data Archives. In August 2023, 66 such datasets were held by 8 of the 34 CESSDA member and partner archives (https://doi.org/10.7802/2650). By mapping all applied keywords to the CESSDA Topic Classification, we were able to determine which topics were strongly covered in the data, and which received only little or no coverage. Thus, we found that while many studies were assigned keywords from the CESSDA topic classes 'Health', 'Social Groupings and Stratification' and 'Society and Culture', within these topics such as reproductive and mental health, elderly, youth, ethnic groups, migration and disability were hardly covered. In addition, keywords were frequently paired with negatively connoted terms such as 'discrimination' or 'bullying', suggesting damage- or deficit-centered approaches. To corroborate these findings, we have begun analyzing study documentation and instruments, e.g. questionnaires, for topics covered and concepts employed. We will present initial findings of this step along with the results of the keyword analysis.
The Data Aren't Alright Or: How I Learned to Stop Worrying and Love the Archives
Sandra Sawchuk (Mount Saint Vincent University)
The settlement of the Canadian prairie provinces in the late 19th and early 20th centuries was shaped by waves of immigration, including significant numbers of Ukrainians seeking new opportunities. Understanding the early settlement patterns of Ukrainians is a challenging task, particularly because of the inaccuracy of ethnic origin data in historical Canadian Census of Population records. This challenge is due in part to the unique political and historic circumstances of Ukraine during major periods of Canadian immigration. These factors complicate efforts to accurately trace the origins of settlers using traditional sources of demographic data. Archival documents, such as homestead records and township maps, contain more accurate place-of-origin data, but they are harder to access because of inadequate digitization. Homestead records include hand-written information about naturalization and citizenship status as well as the date of arrival on the homestead. This information has been partially transcribed through a community initiative, but the database is incomplete and not machine-readable. Township maps contain handwritten names and geographic locations of settlers, but they have not been widely digitized and are accessible only in provincial archives. This presentation will address the limitations of historical census data in capturing the ethnic origins of early Ukrainian-Canadian settlers and highlight the importance of archival research in reconstructing histories that are obscured by systemic inaccuracies in official records. This work is part of a larger program of research investigating the spatial and social dynamics of Ukrainian-Canadian settlement in Canada.
The Longitudinal Impossible Dataset: Helping Users Navigate the ONS Longitudinal Study
Andreas Mastrosavvas (University College London)
Nicola Shelton (University College London)
The Office for National Statistics Longitudinal Study (ONS-LS) follows a 1% sample of the population of England and Wales through each decennial Census, linking Census data with data from birth, death, and cancer registers. As one of the largest datasets of its kind in the UK, the ONS-LS is used for public good research on topics ranging from public health to labour market outcomes. However, access to the data is highly controlled and only possible via secure settings, meaning that researchers must often identify required variables and develop code prior to seeing the data. With thousands of variables available, navigating and exploring the available metadata can be a complex task. This presentation will showcase the Longitudinal Impossible Dataset (LIDS): an interactive artificial data product intended to familiarise prospective users with the data structures and variable domains represented in the ONS-LS. Areas covered will include conceptualisation, development, deployment, and user feedback, sharing insights for practice in secure data user support services for social science research. It will also offer a brief introduction of related initiatives undertaken at the Centre for Longitudinal Study Information and User Support (CeLSIUS).
Sharing our Charts: The Role of Documentation in Navigating Data for Social Researchers
Will Clary (Institute for Social Research)
Lindsay Gypin (Institute for Social Research)
We are in a world where data-driven research shapes policy and public understanding and the significance of meticulous documentation and reproducibility cannot be overstated. Our presentation will focus on the documentation process related to our recently published dataset Voter Registration, Turnout, and Partisanship. This dataset contains counts of registered voters, ballots cast, and voting-eligible population by county in the United States. The National Neighborhood Data Archive (NaNDA) at the University of Michigan compiled and cleaned this dataset to serve as a valuable resource for researchers examining the intricate relationships among voter behavior and demographic factors. NaNDA aims to provide geographic information in an accessible format, breaking down barriers that can hinder the integration of spatial data into social science research. We facilitate seamless linkage to other datasets by making data available in familiar tabular formats. Our commitment to clear documentation and reproducibility is reflected in all our work, allowing researchers to build upon existing work rather than recreate it from scratch. This commitment also allows for the continuity of updated datasets as new data becomes available, keeping our work evergreen and relevant. The open-access nature of NaNDA's datasets ensures that researchers can easily download and utilize our data from platforms such as ICPSR, while also making it increasingly accessible through partnerships with institutions like the Michigan Center on the Demography of Aging (MiCDA) and the Michigan Medicine DataDirect portal. In this presentation, we will illustrate how robust documentation and reproducibility not only accelerate the research process but also foster a collaborative environment that advances social science inquiry. NaNDA is paving the way for impactful research that is both transparent and replicable, seeking to exemplify the future of geographic data services in social research.
Building a Bridge between Data Librarian Skills and Bibliometric or Research Output Analysis
Jennifer Chaput (UMass Amherst Libraries - University of Massachusetts Amherst)
Data librarians can apply their range of skills to supporting bibliometrics and research output analysis projects within the library as well as more broadly at their institutions. This talk will discuss several projects and collaborations that the Data Services department at UMass Amherst Libraries has participated in. Projects include a bibliometric analysis on a specific topic, a research output analysis of open access publications, and overviews of other projects where the team provided consultations or participated more minimally. The talk will include discussion of project workflows, ideas on how to initiate or find these collaborations, what related data librarianship skills were used, and where to get training on new skills in these areas. Special focus will be paid to highlighting the openly available tools and data sources used in this work as a way to demonstrate how to do similar projects with limited resources.
Building computational capacity in data service professionals
Louise Capener (UK Data Service)
Acquiring computational skills is becoming increasingly important for data services staff working across social sciences. This includes computing skills that they need to effectively deliver data services, and computational social science skills that are used for research. These skills are needed to keep pace with new forms of data, new methods, ever-greater computing power, the new opportunities for research that are changing at pace as the data environment evolves, and, accordingly, the evolving needs of those using these kinds of data. Data services, need to be appropriately equipped in order to serve the research community. However, there are many barriers to acquiring these skills across data services staff, especially as both recently qualified and established staff were likely to have been trained in traditional forms of data and statistics. This presentation, with interaction, describes progress and preliminary results from on an UKRI-funded project led by a team at the University of Manchester who are affiliated with the UK Data Service. The project aims to build capacity within (a) UK Data Service data professionals and (b) the wider international data services community, via a variety of upskilling opportunities mechanisms. The vision is to enhance data services capacity globally, enhance the careers of data service professionals, and establish a Community of Practice to contribute to lifelong learning. This will be an interactive session, to hear more about the project and to obtain informal and anonymous participant feedback on the results of the project to date.
Exploring the Role of Data Consultation and data FAIR principles in Modern Librarianship
Olayemi Oluwasoga (IITA)
Digital data has experienced exponential growth and transformed the landscape of research and scholarship. Scholarly libraries now serve as repositories for not just books and text articles, but also for digital data and other digital information. The imperative role of librarians can no longer be undermined or overlooked. Librarians are no longer stereotyped but seen as active members of the research community. They are now becoming increasingly pivotal in ensuring effective data stewardship by adhering to the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. These principles help researchers ensure that their data is discoverable, accessible, and usable by others. They guide the creation of metadata, data curation, and data preservation strategies. Librarians can offer support services through consultations in managing the data life cycle, educating on ethical concerns, and ensuring that compliance requirements are met by researchers. When data consultation and advocacy are incorporated into librarianship, librarians can contribute significantly to fostering transparency, innovations and sustainability in management of data. This can empower researchers to maximize the impact of their work and contribute more to the advancement of knowledge. This paper looks at the future of integrating data management into modern day librarianship.
OSF and Infrastructure Enabling Efficacy in Researcher Practices
Nadja Oertelt (Center for Open Science)
Libraries now play a vital role throughout the research lifecycle, guiding scholars from initial project planning and data management to publication and preservation. As research grows more complex, librarians and support professionals must navigate a dizzying array of tools, best practices, and disciplinary conventions. In this landscape, the Open Science Framework (OSF) is a flexible, community-driven platform capable of supporting researchers at any stage of their research project. This presentation will detail how OSF's features can enhance library consultation and support services and strengthen partnerships across campus. From helping researchers organize projects and ensuring data integrity, to curating metadata standards and fostering collaboration, librarians can leverage the OSF to streamline workflows and improve the overall research experience. Through concrete examples, we will show how advocating for OSF adoption not only diversifies a library's service portfolio, but also positions it as an essential catalyst for transparent, reproducible scholarship. Looking ahead, we will explore OSF's trajectory, focusing on how user feedback informs its ongoing development. Attendees will learn how community-driven input, strategic collaborations with other platforms, and integration with external repositories contribute to new OSF features and refinements. By understanding these influences, research support staff can actively participate in guiding OSF's evolution, ensuring it remains aligned with institutional needs and researcher demands. This session offers both practical guidance on what OSF can do today, and an outlook on its potential tomorrow—empowering libraries to help shape the future of scholarly communication.
June 6, 2025: Concurrent Session H. Panel
Building New Bridges to Enhance Data Discovery
Megan Chenoweth (ICPSR-University of Michigan)
Marley Kalt (ICPSR-University of Michigan)
Alison Sweet (ICPSR-University of Michigan)
Kathryn Lavender (ICPSR-University of Michigan)
Chelsea Samples-Steele (ICPSR-University of Michigan)
Sarah Rush (ICPSR-University of Michigan)
ICPSR has over 60 years of experience archiving data and adhering to FAIR data management principles for the data we hold. But true data discoverability calls for building bridges – establishing connections across systems that help make data more FAIR, no matter where it lives. In this panel presentation, ICPSR staff representing various projects within ICPSR will discuss recent endeavors undertaken by ICPSR archives to promote the discoverability of external data resources alongside ICPSR data and make ICPSR data and related resources findable and interoperable within other platforms. This panel will highlight and discuss ICPSR projects including: A combination data-and-metadata archive that fosters transparency in health research; Building a data lake to make social media data interoperable across online platforms; Developing a remote cloud computing infrastructure and making the system replicable so other data custodians can also provide trusted research environments; Using DDI Lifecycle to display longitudinal data series and clarify the potential for research harmonization across different collections; Developing an API for improved sharing of study metadata with external platforms; Establishing partnerships that enable the mirroring of and/or linking to data available through other external catalogs.
Empowering Data Librarian Authors: Creating Transparent Data Sharing Policies and Avenues for Publishing Open Access Data
Regina Raboin (UMass Chan Medical School)
Julie Goldman (Harvard Library)
Allie Tatarian (Tufts University)
Curtis Brundy (Iowa State University)
Library and information science professionals are uniquely positioned to lead by example in research data sharing practices. The Journal of eScience Librarianship (JeSLIB), has become an anchor for data librarians to publish their scholarship and continually adapts to the increasingly data-driven world. In the Fall of 2023, Curtis Brundy reached out to see if JeSLIB would publish an article about a dataset he collected. At that time, we did not publish data articles, but the Editors decided to pursue this opportunity to support scholarship and accessibility of data within the librarian community. For this new article type, JeSLIB added a Data Editor, and by March 2024 JeSLIB had the article format, template, and author instructions in place to accept our first data article. We're excited to share how JeSLIB has enhanced its support for data sharing and data scholarship through a new initiative with three key pieces: creating a Data Editor position, launching a new "Data in Action" article type, and developing a user-friendly template to make data publication accessible. Our new Data Editor, Allie Tatarian, analyzes and reviews data articles, and verifies all JeSLIB articles meet our data sharing policy. They work directly with our authors to ensure the data linked to their publications meets the FAIR Principles (findable, accessible, interoperable and reusable) for data sharing. Our new "Data in Action" article type, paired with a user-friendly template, creates an easier path for librarians to publish high-impact datasets they've created. This panel presentation will feature a discussion between the author and journal editors as they reflect on collaborating on creating accessible venues and clear processes for sharing data work. We'll share our early experiences and insights that could be helpful whether you're thinking about publishing your own data-focused work or developing similar services at your institution.
The Transformative Role of Artificial Intelligence in Data Science: Enhancing Social Science Information Services.
PATIENCE Meninekele CHEWACHONG AKIH (Eduvos)
This literature review examines the transformative role of artificial intelligence (AI) in advancing data science practices within social science information services, focusing on developments from 2019 to 2024. The growing integration of AI has transformed how data is collected, managed, and analysed, enabling exceptional effectiveness and understandings in social science research and data curation. This paper synthesizes recent scholarly contributions to understand how AI-driven tools, such as natural language processing, machine learning, and automated decision systems, are reshaping information services. It explores critical themes, including the ethical challenges of bias in AI algorithms, the impact on data accessibility, and the enhancement of metadata creation and discovery processes. The review highlights case studies demonstrating the deployment of AI to streamline data workflows, foster interdisciplinary collaboration, and improve the dissemination of complex datasets to diverse user communities. Furthermore, it addresses the implications of these advancements for data stewardship and the role of social science librarians as facilitators of AI-augmented services. Challenges, including privacy concerns and the need for transparency in AI applications, are also discussed to offer a balanced perspective. This paper concludes with a forward-looking analysis of emerging trends, emphasizing the necessity for ongoing professional development and cross-sector partnerships to harness AI's full potential while mitigating risks. By consolidating these insights, this review aims to guide the evolution of social science information services in the era of AI. Keywords: Artificial Intelligence, Data Science, Social Science Information Services, Machine Learning, Metadata, Ethics in AI.
Navigating the AI Act: Implications and exceptions for Research
Hildur Thorarensen (Sikt Norwegian Agency for Shared Services in Education and Research)
Siri Tenden (Sikt Norwegian Agency for Shared Services in Education and Research)
The European Commission's Artificial Intelligence Act (AI Act) promises to be a transformative piece of legislation, shaping the future of AI applications within the European Union. This presentation will delve into the intricacies of the AI Act, focusing on the implications for research and the exceptions provided for this purpose. Research often involves the collection and analysis of large-scale datasets, with increasingly sophisticated AI tools being employed. The AI Act's introduction raises important questions about the regulation of these AI methodologies in research. This talk will scrutinize the specific exceptions within the AI Act for research purposes, providing clarity on their scope and application in research. Complementing or conflicting with existing data protection regulations like the General Data Protection Regulation (GDPR) is a pertinent concern. This presentation will examine the interplay between these regulatory frameworks, offering insights into maintaining legal compliance in data-driven research. This presentation will discuss the implications of the AI Act on research. It will focus on the Act's research exceptions and its interplay with the GDPR. The aim is to provide attendees with a nuanced understanding of the AI Act and strategies for navigating its implications in research settings.
AI and metadata in the classroom: A work-integrated learning project
Elizabeth Stregger (Mount Allison University)
Stephen J. Geier (Mount Allison University)
Sabrina Sandy (Mount Allison University)
Duc Tri Dang (Mount Allison University)
Students in two undergraduate courses explored textual data and scholarly communication through a work-integrated learning project focused on journal metadata migration. They participated in workshops on manual metadata entry, AI prompt generation, and web scraping and scripting. Taking multiple approaches allowed students with diverse levels of digital literacy skills to critically engage with a real-world problem. In Team-GPT, students experimented with two artificial intelligence models, Claude Sonnet and GPT-4o, to convert HTML from a journal website into XML matching the Open Journal Systems schema. They broke the problem into smaller, manageable tasks, testing and refining reusable AI prompts. Along the way, they asked questions about journal practices and metadata challenges, gaining deeper insights into scholarly publishing. For a final assignment, each student created 18 metadata records using the AI-assisted workflow. These records will be evaluated alongside human-created metadata and outputs from a scripted approach to determine accuracy. This work-integrated learning project had substantial benefits for the journal, the library, and students. The journal editorial board will be able to make an informed decision about the migration to Open Journal Systems. Librarians, also the instructors for these courses, blended professional practice into pedagogy and made advances on a library publishing project. Students made meaningful contributions to the Open Access movement. We all practiced divergent problem-solving and continued to build informed opinions on the benefits and challenges of working with artificial intelligence. In this presentation, librarians and students will share our insights into how AI tools and metadata projects can be integrated into educational contexts. We will also discuss how work-integrated learning projects can effectively bridge pedagogy and practice, equipping students with critical skills in digital literacy, problem-solving, and scholarly communication.
Wolfgang Zenk-Möltgen (GESIS - Leibniz Institute for the Social Sciences)
Within the domain of political science studies, I have recently done a paper evaluating the FAIR criteria (making data findable, accessible, interoperable, and reusable) for some of the most relevant general election studies. This covered eighteen large-scale surveys from western democracies with at least two waves and incorporated a comparison between 2018 and 2024. The assessment of FAIRness for these studies showed much room for improvement, and results remained relatively the same compared to the last six years. However, these datasets only represent a small area in terms of empirical political science research, resulting in a new effort to evaluate FAIR criteria for a broader universe. The new study includes datasets used in articles from six well-known and highly recognized political science journals: American Political Science Review (APSR), American Journal of Political Science (AJPS), British Journal of Political Science (BJPS), International Organization (IO), Journal of Politics (JOP), and Political Analysis (PA). It is based on previous work (Key, 2016) with a dataset using the volumes from 2013/2014 and extends this by the volumes of 2022/2023. The FAIR scores for all available studies can be evaluated using dataset persistent identifiers. The analysis will show the state of FAIRness for the whole corpus of data used by these relevant political science studies. In addition, a comparison between recent studies and those from nine years before will be done. Also, factors for increased FAIRness of studies in the domain of empirical political science will be highlighted
Lessons learned and the way forward - leveraging PIDs and metadata for FAIR research workflows
Xiaoli Chen (DataCite)
Persistent identifiers (PIDs) and their associated metadata are uniquely positioned to play an instrumental role in capturing, preserving, and providing access to provenance information that is key to facilitating findability, accessibility, interoperability, and reusability (FAIR) of scholarly resources. Over the last years, we carried out a FAIR Workflows project in close collaboration with a Neuroscience research group, in which we conducted an exemplar neuroscience research project entirely following the principles of FAIR research practices - with an openly shared DMP, pre-registered experiments, published dataset and preprint, and numerous other outputs throughout the project lifecycle, using research tools and platforms integrated with open infrastructure. The project resulted in a comprehensive portfolio within which all people, organizations, and outputs are identified and interconnected through PIDs. In this presentation, I will walk through the work done and share lessons learned in the first three years of the FAIR workflows project: 1. It is key to ensure the infrastructure and tools are ready to support the researchers when they are ready to embrace open practices like preregistration, data sharing, and preprint publishing; 2. Funders and publishers are essential in providing guidelines to support and streamline the integration of open and FAIR sharing practices; and 3. The researcher community works around creating and maintaining data standards and data analysis tools/platforms to enable sharing and reuse can benefit greatly from the integration of PIDs. I will also touch on the ongoing and future work in the project, including the validation of FAIR practices in a collaboration context and the examination of the reuse of these FAIR outputs. This proposal is intended to contribute to the discourse on "data provenance, CARE / FAIR data principles" at the conference; we hope to provide inspiration and invite feedback and engagement from the community for our ongoing work.
Managing Library-Licensed Data: Exploring In-House Control vs. Third-Party Platforms
Jiebei Luo (New York University)
Alice Kalinowski (Stanford University)
Acquiring datasets for library collections entails complex workflows. Initial tasks include investigating datasets, negotiating licenses, and finalizing acquisitions, followed by organizing data, creating documentation, managing user access, and ensuring compliance with license agreements. These processes demand significant effort and resources, requiring coordination across multiple library departments. This presentation will explore the trade-offs involved in managing library-licensed data across two categories: datasets fully handled by the library and those managed through third-party platforms. Attendees will gain a clear understanding of the benefits and challenges associated with each approach, along with a practical checklist to support informed decision-making when choosing between these management models.
Ceilyn Boyd (The Dataverse Project, Harvard University)
SONIA BARBOSA (The Dataverse Project, Harvard University)
Research data is growing exponentially in size and complexity, and funders are increasingly calling upon researchers to share their data. To meet these challenges, the Harvard Dataverse Repository now offers flexible, sustainable solutions for the stewardship and archiving of large-scale research data. We combine the comprehensive data stewardship features of Harvard Dataverse, with storage solutions from Amazon Web Services (AWS) and the Northeast Storage Exchange (NESE), and file transfer protocols like S3 and Globus, to provide cost-effective options meeting today's large data-sharing challenges. Large data project outputs are diverse. They span the gamut from datasets with many small files, to a few extremely large files, or collections that start small but grow quickly over time. Furthermore, data re-users expect access to large datasets on-demand, asynchronously, or via graphical user interfaces (GUI) or scripting using application programming interfaces (API). Our service offerings, from data management planning to archiving, address diverse use cases. We offer pre-deposit consultations, quotes, and letters of support; curation assistance arranging and describing datasets for usability; and long-term storage, access, and retention options for researchers' large research data collections. Harvard Dataverse Repository's Large Data Services empower the research community by ensuring reliable long-term access to large datasets, supporting compliance with funder and institutional data-sharing requirements, and providing cost-effective alternatives to commercial storage solutions. We ensure that data remains accessible, preserved, and valuable for years to come. We look forward to engaging with the community at this conference to share our approach, learn from others, and explore potential collaborations.
Our annual Conference is serial expression of the work of this Association. To evoke library-land ontology still further, our presentations, those articles in the IASSIST Quarterly and the IASSIST website are the collective manifestation of its history. Not there at the start of IASSIST, but it was in the 1980s that I was charged with developing a data library service. Reading the 10 years' content of the IQ, I gleaned two key challenges: standards for cataloguing for datasets and the drive to have data files regarded as first class objects. That exercise informed a paper commissioned by the Committee of Librarians and Statisticians, of the Library Association and the Royal Statistical Society. Fast forward to the changed technology of the 1990s, IASSIST Conference themes included 'numbers, pictures, words and sounds: priorities for the 1990s', 'stewardship of an expanding resource', 'data, networks, and cooperation: linking resources in a distributed world' & 'openness, diversity and standards' - the latter for IASSIST 1993, hosted by Edinburgh University Data Library. Another 10 years on and opportunity to provide a retrospective at IASSIST 2003 on changing technology when reporting progress for Edinburgh's Data Library and its younger sister EDINA. Much of all of this is summarised in a contribution to the special issue of the IQ dedicated to Sue Dodd, "a Pioneer Data Librarian". Then came the Web. But more than technology, there came digital curation, the admixture of digital preservation and data curation, followed by open access, institutional repositories & research data management, data now very much a first class object. Of course, nostalgia is not what it used to be. Nevertheless, in celebrating fifty IASSIST Conferences, this paper seeks to leverage the past, intent on helping IASSISTers make even better history over the next 10, 20 or even 50 years.
The Geospatial, Map, and Data Centre (GMDC) at Toronto Metropolitan University Libraries offers essential support to researchers by providing access to geospatial and numeric data resources. The GMDC team includes a GIS and Map Librarian, GIS Specialist, Data Librarian, and a recently added Statistical Support Specialist. This session will explore the development of the Statistical Support Specialist role and its impact on expanding the GMDC's services, especially in connecting with a wide range of quantitative researchers across campus. Additionally, the specialist's GIS expertise has enhanced collaboration within the team to offer support in spatial statistics. Their experiences of providing in-class and virtual instruction, along with targeted outreach to graduate students and faculty interested in applying geospatial and statistical analysis in their research will be highlighted in this presentation. Their collaboration is also helping to bridge the gap between curriculum-based learning and the development of individual research skills by fostering communication and knowledge mobilization about quantitative thinking by researchers overall.
Leveraging Graduate Student Consultants to Provide Data Science Consulting in Higher Education
Mara Blake (North Carolina State University)
As domains and disciplines face ever increasing need for data science proficiency, universities and colleges must find ways to support their local needs and also prepare students for a multitude of paths. The Alfred P. Sloan Foundation project "Dissemination of Knowledge about Models in Data Science Consulting in Higher Education" brought together a group of representatives from three institutions to develop and share ways that a variety of types of data science consulting programs can leverage graduate student employees. This presentation will share the outcomes of the project, aiming to benefit those looking to start data science consulting programs that employ graduate students as consults or enhance existing programs. The presentation will cover practical topics such as: landscape analysis; scoping and administrative structures; hiring, training and onboarding; and assessment and reporting. The presentation will also present the Data Science Consulting Program at North Carolina State University, a consulting program offered as a partnership with the University Libraries and the Data Science and AI Academy, as a case study. The program employs an interdisciplinary cohort of graduate student data science consultants that offer consulting to members of the campus community. The student consultants bring sophisticated data science skills and disciplinary expertise to the program and allow the library to support the needs of campus users. In turn, librarians and staff offer extensive training to student consultants that develop their communication, teaching, and interdisciplinary collaboration excellence. The student data science consultants leverage their technical expertise and communication training to provide support on a wide range of data science and statistical topics for disciplines across campus.
High Tech? Low Tech? - The Goldilocks Zone in Data Collection & Analysis Consultation for Courses
Michael Beckstrand (Univeristy of Minnesota)
In the rapidly evolving landscape of data tools, apps, and software packages, finding the balance between cutting-edge technology and accessibility is essential to support research and teaching in the social and behavioral sciences. This presentation explores how consultations with lab groups, undergraduate and graduate courses attempt to bridge the gap between past and future approaches to data services, identifying the "Goldilocks Zone" of tools—sophisticated enough to meet research demands yet approachable for students and faculty with diverse technical skills. The presentation will highlight examples from the past year, including supporting research teams conducting large-scale qualitative coding studies with high-tech platforms for transcription, coding, and analysis, and instances where low-tech spreadsheet workflows proved both effective and efficient. Additionally, it will discuss strategies for guiding faculty and students in designing user-friendly data entry forms and equipping students with skills to perform basic descriptive statistical analyses in spreadsheet tools. By bridging the divide between high-tech and low-tech solutions, we foster connections between established methods and innovative practices. These efforts not only empower research teams today but also anchor a future where data services remain inclusive, adaptable, and impactful.
June 6, 2025: Concurrent Session I. Panel
Data Literacy Education in the Era of AI
Ximin Mi (Federal Reserve Bank Atlanta)
Justin De La Cruz (New York University)
Michael Flierl (The Ohio State University)
Use of data for business in many industries, including but not limited to education, research, health and finance, faces new opportunities and challenges in the era of Artificial Intelligence (AI). AI literacy demands data literacy. This elicits the essential questions of: what does data literacy look like in the AI era? How should I become fluent with data as a researcher, educator, healthcare provider, government employee, citizen, etc. to navigate this new data and information environment? This panel will explore the facets of AI opportunities, challenges and ethical considerations in developing data literacy in the AI age. Our panelists will share insights from their work across academia, industry, and healthcare to address topics such as: understanding AI training data and its implications, detecting potential biases in datasets, and evaluating AI system outputs. We will discuss practical strategies for building data literacy skills at both individual and institutional levels. The session will conclude with an opportunity for audience members to ask questions about data literacy capabilities in an AI-driven world. Questions: 1. Opening Thoughts 1). What data literacy efforts exist in your organization? What is your role? 2). What are the AI products / services in your organization, and how are they used? 2. Ethical Considerations for data and AI 1). What keeps you up at night? 2). What are not enough professionals aware of in terms of ethics, data, and AI? 3. Opportunities for data literacy 1). How to engage with data quality when approaching data literacy education? 2). What are the most exciting opportunities for cultivating data literacy in this AI age? 4. Practical challenges and considerations 1).How to optimize the use of organizational AI resources to develop learning materials or perform administrative tasks? 2). How to benchmark or assess the efficacy of AI models interacting with data?
June 6, 2025: Lightning Talks
Building Interdisciplinary Data Curation Partnerships
Talya Cooper (New York University)
NYU Data Services received dedicated funding for the 2024-2025 academic year to curate urban data–broadly defined as data about cities and their populations, a theme that cuts across data in many different academic disciplines. This lightning talk will present the results of this opportunity, which allowed us to pilot a student data curator position. The student worker has the specific responsibilities of liaising between our department and an urban policy institute at NYU; conducting a university-wide landscape analysis to locate creators of urban data; and creating subject-specific documentation guidelines. The talk will discuss the results of this project to date, including methodology for the landscape analysis and observations about training a student curator. It will also detail the lessons learned, and plans to build on this project in the future.
Evolving Literacy Landscapes: Developing a Toolkit for Artificial Intelligence Education
Brenna Bierman (Vanderbilt University)
John Paul Martinez (Vanderbilt University)
Sheldon Salo (Vanderbilt University)
With the increased adoption of generative artificial intelligence (AI) tools by the general public, there is a clear need for information professionals to adapt current information and data literacy frameworks to encompass the unique considerations presented by AI. Information literacy education has had to evolve numerous times as technology alters how professionals teach the general public about accessing information, data, and now, AI generated content. As we move towards a future where AI generated content is increasingly available, we must build upon existing data and information literacy frameworks in order to ensure that professionals, enthusiasts, and greater communities can navigate and critically analyze information presented to them, regardless of how it was created. This lightning talk will consist of a review of existing literature and frameworks from multiple fields and an overview of the authors' creation of an AI literacy toolkit which data professionals, librarians, and academics can incorporate into information and data literacy sessions. Without an understanding of how AI generates responses, the perceived authority of large language models (LLMs) such as Chat-GPT could accelerate the spread of misinformation. The authors' toolkit on teaching about generative AI will enable learners to critically evaluate the responses, sources, and potential biases inherent within AI tools. This toolkit will provide data and information professionals with a pre-made set of materials– including infographics, mini-lesson plans, and interactive activities they can use to interface with the general public regarding AI literacy. Through providing this toolkit, the authors aim to bridge the knowledge gap between information professionals and non-practitioners and increase understanding of AI and LLMs, which should facilitate better decision making regarding future implementations of AI tools.
Leveraging LGBTQ+ Data Sources for Global Advocacy and Research
Kevin Manuel (Toronto Metropolitan University)
Meryl Brodsky (University of Texas at Austin)
Blake Robinson (Rollins College)
Van Bich Tran (Temple University)
As the world progresses toward greater recognition and support for LGBTQ+ rights, the availability and analysis of comprehensive data becomes increasingly essential. In response to these needs, the IASSIST Diversity, Equity and Inclusion's LGBTQ+ Data Subgroup has developed an online guide that highlights data resources about LGBTQ+ populations that are often underreported or inadequately understood due to gaps in data collection and analysis or lack of official recognition at the national level. To address this, we formed an international Subgroup that identified a variety of international LGBTQ+ data sources, ranging from government surveys and nonprofit research to grassroots community-driven data collection efforts. These sources provide invaluable insights into the health, economic, social, and legal experiences of LGBTQ+ individuals, which can serve to inform the development of more inclusive policies and interventions. By building and sharing this collection of LGBTQ+ data, this guide aims to drive more effective advocacy and research in support of global LGBTQ+ equality.
Empowering Indigenous Communities in Data Governance Through Local Contexts
Sarvenaz Ghafourian (Ocean Networks Canada)
Chantel Ridsdale (Ocean Networks Canada)
Steph Golob (Ocean Networks Canada)
Ocean Networks Canada (ONC) recognizes the critical role of Indigenous communities in shaping data governance practices that reflect their unique cultural, ethical, and environmental priorities. To support Indigenous data sovereignty, ONC is integrating Local Contexts labels functionality into its metadata profiles and dataset infrastructure. Local Contexts enables Indigenous communities to assert control over how their data is collected, accessed, and used while hosted on ONC's data repository; fostering transparency and ensuring that cultural significance is preserved. By embedding Local Contexts information, communities can apply labels of biocultural heritage, traditional knowledge, or other culturally specific information directly to datasets. This empowers them to restrict access to sensitive data and communicate usage guidelines to users in a manner aligned with their community's values. Communities retain the flexibility to update these labels as their governance needs evolve, ensuring that their authority remains central to the data lifecycle. ONC is implementing this functionality using DataCite and ISO 19115 to ensure metadata is both machine- and human-readable. ONC's pilot project will demonstrate Local Contexts labels integration through designing a "mock community" with ONC-owned data, providing Indigenous partners with tangible examples of how these tools can enhance their data governance and sovereignty. This initiative is designed to align with the CARE Principles (Collective Benefit, Authority to Control, Responsibility, Ethics) by prioritizing the individual needs and perspectives of Indigenous communities. Ultimately, the integration of Local Contexts labels into ONC's data repository offers a scalable model to further promote Indigenous data sovereignty and governance. This ensures that communities remain the primary stewards of their cultural and environmental knowledge.
Anchoring Data Services in a Law School Library: A Case Study of Early Initiatives
Alisa Lazear (NYU School of Law Library)
As law libraries increasingly engage with data-driven scholarship, launching data services requires navigating uncharted waters of research culture, stakeholder needs, and institutional priorities. This lightning talk will share how the NYU Law Library is charting its course in developing data services, focusing on early-stage efforts to understand research needs, define the scope of services, and address challenges. The talk will also highlight strategies for information gathering, such as surveys and outreach, to better understand the data-related priorities of faculty and students. A critical part of this process has been liaising with the central NYU Libraries's data services team and collaborating with other library departments to leverage existing expertise and resources. Additionally, engaging with diverse stakeholders, including technology departments and administrative teams, has been key to building partnerships. We will discuss opportunities identified during this process, such as supporting data literacy education and facilitating early data management best practices, alongside challenges like resource constraints and varying levels of data literacy. Attendees will gain practical insights from NYU Law Library's experiences in laying the groundwork for data services, including tips for defining the role of data services in their own institutions, starting small, and building momentum for future initiatives. This lightning talk offers a roadmap for law libraries navigating the complexities of establishing themselves as leaders in the evolving world of data-driven legal scholarship.
From Chaos to Clarity: The Transformative Impact of Metadata on Research
Meret Hildebrandt (FORS)
In a world overflowing with data, metadata operates behind the scenes, quietly orchestrating how we organize, analyse, and understand information. But what exactly is metadata, and why does it matter so much for research? This lightning talk shines a spotlight on the "data about data" and its vital role in making the digital data universe functional. Metadata provides the context, structure, and meaning that raw data alone cannot. It's what turns an overwhelming flood of information into something searchable, usable, and actionable and which allows us to develop sound scientific evidence. Metadata helps us navigate, organize, and make sense of the digital chaos. This talk will explore how metadata impacts our daily lives, with real-world examples from social science research areas such as public opinion analysis where it is crucial to judge the analysed data such as social media posts based on metadata like context, geolocation, timestamp or the number of times the post has been shared. Other example will come from migration studies, and educational outcomes tracking. It will also touch on the challenges and opportunities metadata presents, from privacy concerns to its transformative role in artificial intelligence and predictive analytics. For example, when using them in the health sector to develop algorithms that detect a tendency for self-harm or depression in speech patterns. In just a few minutes, you'll discover why metadata is the engine driving the future research and how AI plays a crucial role. Join us to uncover how metadata shapes the way we interact with the world and why understanding it is key to unlocking the full potential of current data-driven social science research.
Supporting research and teaching in social sciences with computational notebooks
James Reid (EDINA, The University of Edinburgh)
EDINA, an innovation unit within Information Services Group of The University of Edinburgh are providers of online services to the UK HE sector. It's flagship Digimap service is used by over 120 universities the length and breadth of the UK. This lightning talk will introduce EDINA's newest service Noteable, providing exemplars from real-life uses of EDINA's Noteable service showing how computational notebooks can provide a low-barrier entry point to learning coding skills, ensuring reproducibility in data analysis for social sciences. Also shared will be the innovative use of APIs from within notebooks to access Large Language Models thus enabling chat-like interactions with social science datasets.
Building a Data Management Network
Alicia Hofelich Mohr (University of Minnesota)
Shannon Farrell (University of Minnesota)
As requirements for data management and sharing continue to grow from US funders, so does the need to scale and coordinate support services on University campuses. This lightning talk will describe our University's growth from building our data management "village" across various campus offices to the development of a new, more specific network - one of Data Managers themselves. We will describe how we are finding data managers who are embedded in labs, departments, and centers to join our network, as well as the outputs of the network, including a "Day of Data Clean Up", various meet and greets, and crowdsourced guides. We will also discuss future goals of the network and other benefits of having campus data managers a quick message away.
Beginning at the end with open scholarship curricula
Crystal Steltenpohl (Center for Open Science)
The growing emphasis on research transparency, data sharing, and reproducibility demands that researchers adopt open science practices. However, barriers such as limited awareness, training, and resources continue to hinder adoption while policies continue to be announced and implemented. The Center for Open Science has developed the "Introducing and Practicing Open and Reproducible Scholarship" program to address these challenges by providing free, adaptable, and practical materials designed to equip researchers and educators with the tools and knowledge necessary to navigate this evolving ecosystem. To ensure that the program met critical researcher, funder, and institutional needs COS staff utilized "beginning at the end" exercises to identify the primary skills that researchers need to learn in order to be compliant with trending data policies and how they are most likely to apply them. We developed new modules designed to organically connect these policies to the rigorous research practices needed to comply with them, and applications of those practices within our research lifecycle support tool, the Open Science Framework (OSF). This talk will describe our curriculum design process, our findings in relation to prioritized skills, and the measures we took to keep audience needs as the primary reason and focus of each module. We will also share our efforts to enable advocates and experts to utilize the modules within their own communities through careful pedagogy considerations, documentation, and resource availability. Finally, we will describe our ongoing efforts to evolve and improve our resources and partnerships across scholarly communication communities.
Exploring Natural Language Search for Data Retrieval using Large Language Models (LLMs)
Harold Kroeze (Statistics Netherlands)
Natural language search is increasingly important in data retrieval, as it enables users to search for data using everyday language. In an experiment, we're leveraging Large Language Models (LLMs) to enable users to search for data in natural language within Statistics Netherlands' Data Service Centre (DSC). We're utilizing a combination of sources, including dataset and variable descriptions from the DSC, tips and tricks scraped from the intranet, and concepts from the Statistics Netherlands rdf store. A subset of DSC data is formed by data from the System of Social statistical Datasets (SSD): a comprehensive repository of microdata covering various aspects of people's lives, such as health, education, work, relationships, crime, and social benefits. Our goal is to develop an LLM-based search functionality that enables researchers to retrieve relevant variables from the SSD in a more intuitive and efficient manner. Our approach, agentic Retrieval Augmented Generation (RAG), involves embedding textual infor
Katja Moilanen (Finnish Social Science Data Archive)
The Finnish Social Science Data Archive (FSD) is enhancing its study-level metadata production to ensure a smoother information flow. This improvement is based on two revamped systems: the data depositing tool in the Aila data portal (currently being implemented) and the operational data management system TIIPII3 (already in production). Aila and TIIPII3 metadata tools are implemented utilizing e.g. Python3, lxml, RESTful APIs and AMQP messaging with RabbitMQ. The data depositor will provide most of the contextual study-level metadata using the data depositing tool. Due to GDPR regulations, the data depositing tool may not be used for long-term preservation. When the data deposits are removed from the depositing tool, the study-level metadata will be exported as a DDI 2.5 XML file in Finnish. This exported file will then be enriched with additional information first manually by the metadata curator and then using TIIPII3 (such as series information and related publications) and saved as an XML file in the TIIPII3 file system. For our international customers and collaborators, we also produce study-level metadata in English. TIIPII3 provides automatic translations for many metadata fields, but some require manual translation. The English metadata is also saved in the TIIPII3 file system as a DDI 2.5 XML file. DDI 2.5 XML is our long-term preservation format for metadata. When study-level information changes in TIIPII3, the XML files in both languages are automatically updated accordingly.
The idea of developing an ISO standard for DDI has been discussed for years. Several factors stymied the efforts, and these are: 1) lack of resources, particularly people with experience to write and shepherd the project forward; 2) the decision of what aspects of DDI products to standardize under ISO without creating a lot of duplicative work; and 3) identification of an ISO TC (technical committee) to take the work. The Scientific Board under the DDI Alliance created a temporary Working Group to explore the feasibility of creating the ISO standard, and this tWG was able to address the issues raised above. The work is underway, and a draft of the standard is written. The next steps are to gain approval of the DDI Alliance to move forward and submit the proposal to ISO/TC46/SC4. This poster describes the structure of the draft for standardization, the state of the approval process under the Alliance, and the process under ISO/TC46/SC4 for moving the document forward.
Data discovery made easy: enhancing access to the Great Britain Historical GIS via a Large Language Model
Humphrey Southall (University of Portsmouth)
Xan Morice-Atkinson (University of Portsmouth)
Paula Aucott (University of Portsmouth)
The "Data Discovery Made Easy" project (DDME) is funded by the UK Economic and Social Research Council as part of their "Future Data Services (Pilots)" programme. It is adding a prototype natural language search interface to the existing web site A Vision of Britain through Time. This is a public interface to the Great Britain Historical GIS (GBHGIS), a large Postgres/PostGIS database holding data from every British census 1801-2021, diverse other statistics including vital registrations and the farming census, and digital boundaries for most of the ever-changing reporting geographies. We argue that existing data services have become too focused on the needs of data scientists, investing substantial time in learning to navigate download systems. We focus instead on mainstream social scientists and others like journalists and policy analysts, often seeking just one local time series or even a single data value. The GBHGIS holds all statistics in a single central data store, but the diversity of content and the enormous complexity of Britain's statistical geographies makes data discovery challenging. Our poster will provide an overview of the DDME project, including: <> Our survey of user needs, focusing on social scientists whose main concerns are with their own surveys, or theoretical, but access secondary data to provide context. <> An external review of our data model, and its compatibility with current data standards including DDI and SDMX. <> Our metadata editors, enabling our unique data structure to be more easily extended without an intimate knowledge of the model. <> The new natural language search interface, acting as a bridge between the non-specialist user and the data repository. It currently depends on Large Language Models (LLMs) from OpenAI but is designed so that these can be replaced by a future locally-hosted LLM, reducing costs.
IASSIST at 50: Where Have We Been and Where Are We Going
Christine Nieman Hislop (University of Maryland, Baltimore)
Meryl Brodsky (University of Texas, Austin)
Michele Hayslett (University of North Carolina at Chapel Hill)
Wei Yin (Columbia University)
Thomas Lindsay (University of Minnesota)
Margaret Adams (Retired)
Cindy Severt (Retired)
IASSIST 2025 marks the 50th conference for the International Association for Social Science Information Service & Technology (IASSIST). This poster will map the locations of previous conferences and highlight locations and regions that have served as repeated hosts for past IASSIST conferences. As we explore and celebrate 50 years of IASSIST conferences, this poster will also demonstrate the geographic gaps and future opportunities for expanding global participation in serving as a conference host and diversifying attendance. One of the hallmarks of IASSIST has been its international perspective. This poster aims to showcase and reflect that trait.
In early 2024, Stanford Graduate School of Business Library acquired a bespoke collection of datasets from MSCI – a financial research and data organization renowned for their proprietary global indexes, ESG (environmental, social, and governance), and climate change data. As we were onboarding the ESG dataset into our collection, we encountered multiple obstacles in the process due to its complex structure and size. Through a close collaboration between librarians and a data scientist, we were able to assess problems in real-time, brainstorm scalable solutions, format and create a collection of comprehensive documentation, and ultimately introduce the new content to our constituents. This poster will delve into the collaborative and iterative process of ingesting the ESG data set specifically, describing and grouping the 28 individual data tables (including aggregating existing content from disparate locations and creating new supporting documentation), and launching the new dataset to Stanford University researchers.
Bridging the Primary to Secondary Data Analysis for Open Science: Buiding the SSJDA Panel and Its Research Data Management
Kenji Ishida (University of Tokyo)
Sho Fujihara (University of Tokyo)
Sae Taniguchi (University of Tokyo)
Inspired by other projects such as the GESIS Panel, the Center for Social Research and Data Archives (CSRDA) at the Institute of Social Science of the University of Tokyo launched a longitudinal survey, the SSJDA Panel, based on a probability sample nationwide in 2021. As a part of advancing the open science environment in social sciences, the SSJDA Panel has conducted surveys twice a year (October and February) and solicited question proposals across the world. While secondary data analysis has been pervasive in social sciences in Japan since the SSJDA started its service in the late 1990s, there have been fewer chances for researchers in their early careers to design and conduct probability-based sample surveys. That is our primary motivation for constructing an international collaborative research platform that is particularly suitable for young scholars. Furthermore, we have disseminated the completed datasets as soon as possible through the SSJDA to promote secondary analysis as well. In the current presentation, we introduce the overview of the SSJDA Panel and its research data management in the data life cycle. Also, we share and discuss the current issues to be addressed and prospects for the sustainability of the panel.
Program for Strengthening Data Infrastructure for the Humanities and Social Sciences
Nobutada Yokouchi (Institute for Social Sciences, The University of Tokyo)
Sae Taniguchi (Institute for Social Sciences, The University of Tokyo)
Masayuki Shioya (Institute for Social Sciences, The University of Tokyo)
Sayaka Terazawa (Institute for Social Sciences, The University of Tokyo)
Satoshi Miwa (Institute for Social Sciences, The University of Tokyo)
This poster session presents the "Program for Strengthening Data Infrastructure for the Humanities and Social Sciences," led by the Historiographical Institute and the Institute of Social Science at the University of Tokyo, commissioned by the Japan Society for the Promotion of Science (JSPS) in 2023. The program focuses on enhancing the Japan Data Catalog for the Humanities and Social Sciences (JDCat), a data platform facilitating interdisciplinary data sharing and utilization. Accessible in Japanese and English, JDCat consolidates metadata from various humanities and social sciences fields into a single searchable platform (https://jdcat.jsps.go.jp). Initially developed during the preceding JSPS program, "Program for Constructing Data Infrastructure for the Humanities and Social Sciences," JDCat has been operational since 2021, utilizing a metadata schema based on DDI Codebook, compatible with other schemas like JPCOAR. The current program aims to scale up JDCat by diversifying the types of metadata it collects. For instance, plans are underway to incorporate metadata for wooden tablets, necessitating revisions to the existing JDCat metadata schema and controlled vocabulary. These updates, while crucial for expanding metadata capabilities, may conflict with FAIR (Findable, Accessible, Interoperable, and Reusable) principles, presenting a significant challenge. This session highlights strategies to address these challenges, including schema enhancements and controlled vocabulary adjustments, ensuring alignment with international metadata standards. By sharing these experiences, we invite feedback from participants and explore best practices for overcoming similar obstacles. Additionally, we seek to identify opportunities for collaboration with institutions interested in advancing metadata frameworks for the humanities and social sciences. The poster offers insights into the program's progress and underscores the importance of robust data infrastructure for fostering interdisciplinary and international research collaboration.
Skills Development for Managing Longitudinal Data for Sharing: A Showcase of Training Resources Created for the Longitudinal Population Studies Data Managers Community
Liz Smy (UK Data Archive, UK Data Service)
Hina Zahid (UK Data Archive, UK Data Service)
Cristina Magder (UK Data Archive, UK Data Service)
Gail Howell (UK Data Archive, UK Data Service)
The project Skills Development for Managing Longitudinal Data for Sharing aimed to improve skills and practices for sharing and managing Longitudinal Population Studies (LPS) data across the social, economic, and biomedical sciences. The initiative, funded by the Economic and Social Research Council and the Medical Research Council, as part of the wider PRUK work in the UK, built on a prospectus published by Health Data Research UK in 2021, which highlighted the need to maximise the use of LPS data. Our poster will showcase the interactive training workshops we designed and delivered to over 300 data managers and associated data professionals. We will highlight the cross-collaborative approach that underpinned the project to inform the content and delivery of the training. Additionally, we will emphasise the role of outstanding facilitation in creating engaging, practical workshops and the importance of following up with participants to reinforce learning and provide continued support. The tiered training approach we employed ranged from foundational skills to advanced topics such as the creation and dissemination of synthetic data and the use of semi-automated tools for harmonisation. We will also discuss how continuous feedback and evaluation were collected to ensure the training met the evolving needs of the LPS data managers community. Furthermore, this will further allow us to present our freely available, open-licensed training materials developed as part of this project, which can be used, shared, and adapted by the LPS community and broader data professionals.
Long Covid Data Dive Club: Using data to generate awareness about Long Covid
Jonas Recker (GESIS - Leibniz Institute for the Social Sciences)
Janet Gunter (LongCovidSOS)
Created as an open grassroots initiative of people with Long Covid or ME-CFS in 2024, the Long Covid Data Dive Club (https://www.longcoviddatadive.org/) uses data to raise awareness about Long Covid and issues surrounding it. Its first two projects are: Identifying funding to date for Long Covid in the UK and Germany, and mapping Covid "vaccination deserts" in the UK, areas where Covid vaccines are not available. The Dive Club hopes to engage with research funders and healthcare providers to make the case for more research funding and improve access to care for people with Long Covid. This poster will present the Long Covid Data Dive Club initiative and its data collection and visualization projects.
Discover the FORS replication service
Emilie Morgan de Paula (FORS - Swiss Centre of Expertise in the Social Sciences)
Meret Hildebrandt (FORS - Swiss Centre of Expertise in the Social Sciences)
Pedro Araujo (FORS - Swiss Centre of Expertise in the Social Sciences)
Sharing data and related materials such as analysis codes that have been produced in the context of research projects is more and more encouraged and has become a requirement for many research funders and academic journals. This is expected to contribute to the transparency, replicability, and reproducibility of empirical social science research. Various scientific disciplines are gradually coming to the realization that it is essential to become more transparent through replicable and reusable research processes and results. Recently more and more academic journals are implementing open data policies, posing the question for researchers of where, when, and how to share data as well as other research-related materials. As journals and funders demand the sharing of data used in publications, the way social science research is conducted and disseminated is undergoing a progressive change and the available infrastructure is being adapted. A crucial element of this change is the transparency of research. Related to this, replicability and reproducibility are becoming increasingly important in the social sciences. This includes making research data, analysis code, and study materials freely available. In short, facing an injunction to transparency, replication is a core principle of scientific progress and is part of the evidence-making process. To improve reproducibility processes, FORS has developed a technical solution and guidelines for the dissemination of replication material in the social sciences and related disciplines (e.g. economics, psychology). The FORS replication service is dedicated to replication material (that is assigned a DOI) and allows publishers, reviewers or any interested person to access the data, code and any other files or information needed to enable replication.
CDSP's "Banque de données": a French pioneer in long-tail research data curation for twenty years
Alina Danciu (Sciences Po)
Lucie Marie (Sciences Po)
Amélie Vairelles (Sciences Po)
The Center for Socio-Political Data (CDSP) is a French pioneer in quantitative long-tail data curation: its "Banque de données" has been sharing DDI-curated research data for twenty years. The most downloaded data concerns: political attitudes and behaviour, gender, family, immigration, school, health, cultural practices, new technologies, etc. Also, CDSP's DataBank was the first source of French election results beginning with the 1960s and until 2012. The CDSP has the first French probabilistic panel, ELIPSS, whose data is shared also on the DataBank. In August 2023, the CDSP's DataBank became the first French quantitative SSH CoreTrustSeal repository. The CDSP reaffirms its commitment to trustworthy and transparent preservation processes for its French and international research community. This poster highlights the CDSP's DataBank, focusing on its data deposit, curation, as well as preservation processes. Special attention is given to our alignment with international standards like DDI and with the FAIR principles, and to our strategy for ensuring trustworthy, and accessible data. We also showcase the services and workflows designed to support researchers in depositing their datasets, with a focus on our data acceptance criteria.
Using REDCap to enhance longitudinal data management: a case study from Nepal
Subash Gajurel (Nepal Injury Research Center)
Felix Ritchie (University of the West of England, Bristol)
Dionysia Kordopati (University of the West of England, Bristol, UK)
Sunil Kumar Joshi (Nepal Injury Research Centre)
Julie Mytton (University of the West of England, Bristol, UK)
REDCap is a widely used platform that supports the entire research data lifecycle, from collection to deidentification, linkage, and analysis, within an integrated platform. In Nepal, road traffic injuries (RTIs) pose major public health and economic challenges, hindered by limited data availability. Within the SafeTrip Nepal road safety research programme (2022-2026), REDCap and Data Governance (DG) practices are being utilized to address these gaps to inform policy decisions. This poster demonstrates constructing longitudinal datasets for analysis from repeated collection points, managing data sensitivity levels, and aligning processing stages with DG plans to improve auditing, risk management, and research outcomes. A four-stage DG plan was implemented: Planning (REDCap installation, training, and role assignments); Data Collection (real-time data entry, monthly follow-ups, and quality control at two sites with four hospitals, two data collectors per hospital, and one supervisor per site); Analysis (using SPSS and Power BI for data analysis and visualisation); and Output (secure storage, deidentification, role-based access, and ethical compliance). Post-project data will be archived for five years and securely destroyed. REDCap enabled efficient follow-up data management. Key strengths included robust quality control, confidentiality, secure data handling, and efficient role-based access. Implementing a DG framework further enhanced data practices by promoting standardisation, accountability, compliance, and risk management. Challenges such as maintaining consistent training and addressing technical issues were mitigated through regular technical support, iterative training sessions, and assistance from the REDCap community. This study demonstrates the feasibility and efficiency of managing RTI research data using REDCap and DG. The method can be expanded to address more social/public health issues and enhance data security, quality, and ethical compliance. It provides a framework for improving data management procedures, assisting with well-informed decisions, and achieving high-quality and ethical research outcomes.
Trends in the use of academic repositories and social networks in promoting LIS and Communication scholar papers
Nicoleta Roxana Dinu (National Library of Romania)
The aim of the poster is to analyse the habits of researchers and librarians when promoting their work through repositories and academic social networks in the fields of Information, Documentation and Communication. Repositories are also used for preservation and early publication (preprints) of journal articles. A bibliographic review, an online survey, interviews with experts, and an analysis of the evolution of some repositories have been carried out. As results I would mention: Thematic repositories, which were widely used 20 years ago, are now facing competition from institutional repositories (IR) and academic social networks such as ResearchGate, Academia.edu and others. A continued decrease in the number of deposits in the e-LIS (Eprints in Library and Information Science) and MPRA(Munich Personal RePEc Archive) repositories has been observed in recent years, which seems to be due to the fact that many universities have published mandates that force professors to archive their work in their IR. Another important cause of the decrease in deposits is the greater availability of open access journals in recent years, which has made it less necessary to archive articles in repositories. Academic social networks, on the other hand, are perceived as more dynamic, allowing interaction with other authors, and not requiring the introduction of so much metadata to upload documents.
Research Collaboration to Model the Future of Freshwater Salinization
The increasing salinization of freshwater resources is a growing threat to ecosystems, particularly as urbanization rises and salts and deicers are widely applied on impervious surfaces such as roads, highways, parking lots, and driveways. Winter salting leads to higher chloride (Cl) level in streams, rivers, and lakes, which is detrimental to the health and reproduction of many freshwater species. However, there is a lack of predictive models to understand how urbanization, climate change, and land management practices influence Cl concentrations in streams. Previous research has relied on proprietary models or has not accounted for important factors like the long-term retention of salt in soil and groundwater, which can have delayed effects on streams, especially in areas with mixed land use. Moreover, many existing models are calibrated using short-term data (often less than a year) and lack proper validation. This study aims to fill these gaps by using an integrated modeling approach to predict Cl concentrations in urban streams, considering varying land use, climate conditions, groundwater Cl contributions, and salting practices. Researchers from Toronto Metropolitan University (Ontario) collaborated with the DataSquad team at Carleton University (Minnesota), combining their expertise in environmental modeling and data science through a summer internship. This led to ongoing collaboration and co-authorship opportunities for the students assisting in the project. This poster will highlight the research context, collaboration process, student contributions, and key learnings from the experience.
Supporting Research Libraries with OpenAIRE Services: Providing Essential Assistance
Maja Dolinar (OpenAIRE AMKE)
OpenAIRE offers a comprehensive suite of advanced services aimed at augmenting the capabilities of research librarians by providing tools that enhance data management, accessibility, and analysis. This poster delves into how OpenAIRE services significantly benefit research librarians, empowering them to advocate for open science and elevate research outcomes. OpenAIRE services extend support throughout the entire research lifecycle, from planning to dissemination. A prime example is the Argos platform, which streamlines the creation and management of Data Management Plans (DMPs). Argos facilitates compliance with FAIR data principles, rendering DMPs actionable and interconnected with broader research outputs. This enables librarians to assist researchers in managing data effectively while ensuring long-term accessibility and adherence to funding mandates. Additionally, OpenAIRE CONNECT equips librarians with tools to build tailored research gateways, enhancing the discovery and dissemination of scholarly work. This service not only improves thematic discovery and boosts the visibility of research outputs but is also customizable to align with institutional branding, thereby reinforcing institutional identity. Furthermore, OpenAIRE MONITOR provides librarians with advanced analytics capabilities, enabling them to track and assess the impact of research activities across various dimensions. This supports evidence-based decision-making and strategic planning, which are crucial for the sustainability of research institutions. Moreover, OpenAIRE EXPLORE provides extensive access to a diverse array of scholarly outputs, aiding librarians in efficiently cataloging and distributing knowledge. Zenodo ensures the archiving of research data, maintaining its accessibility and citability. Meanwhile, OpenAPC promotes transparency in managing publication costs, facilitating informed financial decisions. In essence, OpenAIRE enhances librarians' roles as knowledge stewards by offering innovative tools for managing, supporting, and disseminating open science practices. By aligning with open science initiatives, these services not only bolster institutional research capacities but also underscore the vital role of research librarians in promoting transparency, reproducibility, and broader dissemination of scholarly work
Wolfgang Zenk-Möltgen (GESIS - Leibniz Institute for the Social Sciences)
Hilde Orten (SIKT)
Darren Bell (UKDS)
Making data FAIR, i.e. findable, accessible, interoperable, and re-usable, has recently become an important goal for research data documentation. DDI metadata can support FAIR documentation of research data along the complete research data lifecycle. Specific products of the DDI Alliance help to achieve the FAIR principles. For example, by supporting persistent identifiers and documenting conceptual components, research data becomes more findable. By documenting data products in DDI on conceptual as well as on physical levels, in the case of DDI-CDI including the datums itself, data is more accessible to users. From the beginning, DDI has the approach of being readable for both humans and machines, which makes data and metadata interoperable. Using standardized re-usable components like code lists, controlled vocabularies, and questions and variables from the DDI product suite, re-usability is greatly improved. The poster will highlight how DDI standards and specific products help to improve the FAIRness of research data and thus contribute to open science and better research data management.
Unlocking the potential of standardised scales through metadata
Claudia Alioto (CLOSER The home of longitudinal research)
Rebecca Oldroyd (CLOSER The home of longitudinal research)
Jon Johnson (CLOSER The home of longitudinal research)
Standardised scales, also known as summated scales or validated questionnaires, are a group of related questions that measure an underlying concept. These scales are valuable research tools as they are cost-and time- effective to implement, and allow researchers to reliably measure concepts across samples and over time. The use of standardised scales also enhances the comparability of research data. However, finding information (i.e. metadata) about standardised scales is challenging and time-consuming. Information is typically scattered across multiple sources and documents, if available at all, and permission is sometimes required to access the scale. Additionally, some scales have multiple versions that include a subset of the original items, and it can be difficult to trace these versions back to the original. It is also difficult to find information about where these scales have been used in existing research. CLOSER aims to address these challenges by making information on multiple standardised scales openly accessible to researchers. We have gathered and documented up-to-date, comprehensive metadata about the scales used in the CLOSER Discovery studies in one centralised, publicly available platform: CLOSER Discovery. Users can now find information about the name, citation, question items (both the original and other versions), topics measured (e.g. alcohol consumption, physical health, depression), and their usage in the CLOSER Discovery study questionnaires and datasets for 10 standardised scales. We are preparing to add metadata for an additional 10 scales to CLOSER Discovery in early 2025. To our knowledge, CLOSER Discovery is the only platform providing such detailed metadata on standardised scales, enabling researchers to identify where scales and used both within and across studies. This poster will describe the process of creating comprehensive standardised scale metadata, the benefits of this metadata for researchers, and our future plans.
How to Engage the Community in Citing Data? CESSDA Data Citation Guide
Christina Bornatici (Swiss Centre of Expertise in the Social Sciences (FORS))
Tuomas J. Alaterä (Finnish Social Science Data Archive (FSD))
Dimitra Kondyli (So.Da.Net)
Farah Karim (GESIS - Leibniz Institute for the Social Sciences)
This poster complements the presentation on CESSDA Data Citation Recommendations, a comprehensive set of best practices which aim to foster a sustainable data citation culture. In the poster, we will highlight the recommendations, particularly those targeting data repositories and those most effective in enhancing the community's role in making data citations more visible and impactful. The recommendations are designed to be practical and concrete, showcasing best practices and technical implementations for community stakeholders. We will engage in discussions with the visitors to our poster and gather community feedback to further refine the citation recommendations.
Harnessing AI to Elevate Machine-Actionable Data Management Plans (maDMPs)
Markus Koskela (CSC - IT Centre for Science)
Johanna Laiho-Kauranne (CSC - IT Centre for Science)
Jukka Rantasaari (University of Turku)
Juuso Repo (University of Turku)
This demonstration explores the transformative potential of artificial intelligence (AI) in advancing data management planning and the development of machine-actionable data management plans (maDMPs). We highlight how AI can revolutionize research data management by improving processes like data preparation, model training, and human oversight. Our primary targets are to understand the transformative role of AI in data management, and critically evaluate the extend of required human oversight. Key areas of focus include: Developing AI tools tailored to specific roles like researchers, data stewards, and funders. Using methodologies such as data analytics, deep learning, and large language models to enhance DMP prototypes. Addressing contemporary topics such as digital objects, ontologies, and data spaces for academic and commercial research contexts. Examples of our use cases contain dynamically drafting and updating DMPs throughout the research lifecycle, ensuring compliance with reproducibility standards, and evaluating DMPs against selected performance criteria. Other scenarios include supporting ethical reviews, generating data availability statements, and deploying coaching tools for reproducibility and data stewardship. We will also assess the importance of human oversight,and how the limitations of AI impact data management practices. We offer insights into the future of AI-enhanced research workflows and show how innovative approaches to AI-driven data management solutions can be applied.
Reproducibility and future challenges: Labour market inequalities using UK Census data
Placide Abasabanye (University of Manchester)
The UK Census has long been a cornerstone of social science research, providing a standardised and comprehensive dataset for the analysis of societal issues such as labour market inequalities. This poster explores the reproducibility of a study of employment disparities using the 2021 UK Census, focusing on differences by gender, ethnicity, immigration status and geographical region. Adhering to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles, this analysis uses robust methodologies and open tools, ensuring that the results can be replicated and extended by researchers and policy makers. The main findings reveal significant disparities in employment rates, income levels and occupational representation, highlighting the systemic challenges to achieving equitable labour market outcomes. Going forward, the UK's potential transition from a traditional census to a reliance on local administrative data poses significant challenges in terms of reproducibility and consistency. Administrative data sources often vary in terms of definitions, coverage and accessibility, raising concerns about their granularity and comparability over time and between regions. This is likely to exacerbate data gaps and biases, particularly for under-represented groups, as well as complicating the replication of studies and the formulation of evidence-based policies. This poster highlights the essential role of standardised national datasets, such as the UK Census, in promoting reproducibility and comparability. It describes the future challenges associated with the transition to administrative data and proposes strategies to mitigate these risks, including investment in data standardisation, metadata development and integrated data infrastructures. In addressing these issues, the research highlights the importance of maintaining robust and accessible data sets to support informed decision-making and equitable social policies.
Creative Speculations for Building Sustainable Support for Qualitative Research and Pedagogy
Di Yoong (Carnegie Mellon University Libraries)
Jessica Benner (Carnegie Mellon University Libraries)
At our institution, support for qualitative research and pedagogy has been limited and inconsistent, which echoes the findings of Swygart-Hobaugh (2016), now almost ten years later. As we hope to build out more consistent and sustainable services and programming, much like Castello, Kellam, and Tran (2024), we have been conducting a qualitative research study to identify researchers, practitioners, and instructors at our university to better understand common workflows, tools, and approaches. To better support discussions on metadata and open knowledge for qualitative data at our institution, we are interested in understanding data storage practices and avenues of publications, as well as how and what data sources are used in teaching. As we share our results and findings, we would love to invite fellow attendees to engage in some creative speculation with us to expand our services and programs to support researchers, practitioners, and instructors who are coming from varied disciplines. We aim to collate our findings and discussions to design programming that will help create better connections and collaborations across campus with folx who are interested in and/or have been doing qualitative work.
Reimagining Catalogs and Repositories to Enhance the FAIRness of Research Data
John Marcotte (University of Michigan)
Sarah Rush (University of Michigan)
Kelly Ogden-Schuette (University of Michigan)
Research data archives typically comprise catalogs and repositories that are tightly woven together in which research data are both discovered and accessed through the same platform. While this archive design enables the catalog to reflect the repository in substantial detail, it can limit discoverability and interoperability. Furthermore, studies can contain diverse types of data such as surveys, biomarkers, and neuroimages that may be available through different specialized repositories. These different repositories often have different metadata schemas. We propose to loosen the tight coupling of a catalog and to a single repository by catalogs incorporating study metadata from many different repositories.Catalogs will enable data discovery while repositories will provide data access. Repositories host and deliver access to the research data while catalogs provide searchable listings of the contents of repositories. As a result, the FAIRness of research data will enhance. Our proposed archive design requires API (application programming interface) for each repository, so that catalogs can communicate with repositories to produce searchable listings. Catalogs typically incorporate metadata from their own repositories. In our reformulation, catalogs would pull metadata from multiple repositories. For example, in the new design, a catalog might connect with three repositories. "Repository 1" could be Dataverse; "Repository 2" is Collectica for longitudinal studies; and "Repository 3" could be dbGaP. Other types of repositories host videos, neuroimages, characteristics of geographic areas, and qualitative data. This design would enable researchers to discover different types of data that are related. The new design will also enhance interoperability as repositories will need to provide a standards-based API for communication with different catalogs. We will present how our formulation enhances findability and interoperability of research data. Moreover, we show how to include different metadata schema into one catalog
Enhancing Research Data Usability through Turbocurator: An AI-Driven Tool for Creating FAIR Metadata
Margaret Levenstein (University of Michigan, Institute for Social Research, ICPSR)
Jeannette Jackson (Institute for Social Research, ICPSR)
ICPSR one of the world's largest social science data archives is located at the Institute for Social Research at the University of Michigan. ICPSR has pioneered an AI-driven tool called TurboCurator, designed in collaboration with Harvard's Dataverse, to promote FAIR data principles by providing an easy to use tool to improve the quality of metadata. Often, data depositors provide insufficient metadata, making the data harder to reuse. This poster showcases the development and application of TurboCurator, a tool that assists data depositors by generating suggestions for Title, Description, and Keywords—identified as key metadata components through user research with data depositors and data curators at data archives around the world. Furthermore, the poster demonstrates how incorporating worldwide social science metadata standards in post-processing checks provides quality assurance for the suggested metadata. TurboCurator quickly generates recommendations, but ultimately, the data depositor decides which metadata to use. By streamlining the data submission process, TurboCurator enhances the overall quality and usability of research data, thereby supporting open science.
Advancing Social Science Collaboration through Open Science: The Czech Republic's EOSC Journey
Ilona Trtíková (Institute of Sociology of the Czech Academy of Sciences)
Martin Vávra (Institute of Sociology of the Czech Academy of Sciences)
Jindřich Krejčí (Institute of Sociology of the Czech Academy of Sciences)
Tomáš Čížek (Institute of Sociology of the Czech Academy of Sciences)
Yana Leontiyeva (Institute of Sociology of the Czech Academy of Sciences)
The Czech Republic is currently establishing a national node as part of the European Open Science Cloud (EOSC) initiative. This node aims to promote best practices in research data management across scientific communities on a federated basis. A key objective of the EOSC implementation in the Czech Republic is to develop a National Data Infrastructure (NDI), a shared platform for data sharing, management, and access to computational resources for research purposes. The NDI implementation is designed to support scientific and multidisciplinary research activities across a wide range of disciplines, including the social sciences. Within this scope, the new Open Science II project plays a pivotal role by focusing on the social sciences as one of its target areas. What is particularly innovative is the collaboration it fosters among fields such as sociology, social geography, and economics—disciplines that traditionally operate independently. The project addresses challenges related to new data types, including synthetic data, AI-generated data, data from maps and paradata, while exploring their potential to advance social science research. The poster will present the key themes tackled within the project, highlight the collaborative efforts among social science disciplines, and identify specific challenges that arise in this context.
Implementing PID Policy in Dataverse: Challenges and Opportunities
Vaidas Morkevicius (Kaunas University of Technology)
The Dataverse repository software is one of the most developed platforms for data curation available in the market. Importantly, it strives to provide platform that allows curating data in best possible compliance with the FAIR guiding principles and the TRUST principles for digital repositories. However, different aspects of the aforementioned principles require complex implementations that are not always easily achieved. One of the important principles related to data discoverability is persistent identifier (PID) issuing for data objects (data/documentation files, datasets, data collections etc.). In its current version (6.4) Dataverse allows issuing globally unique persistent identifiers for datasets and data/documentation files within datasets. Organizations that employ Dataverse software for data curation in their repositories are dependent of the software when formulating their PID issuing policy. In this poster we will attempt to demonstrate PID issuing opportunities and limitations provided by the Dataverse software and how this may affect available options for data curating organizations that are in the process of (or plan to) developing PID issuing policy for their data objects. It will also try to hint at possible future developments needed for Dataverse software in order to make it even more flexible platform providing diverse solutions for PID issuing.
The OSF Data Detectives! A Game of Persistent Identifiers
Crystal Steltenpohl (Center for Open Science)
As the OSF Data Detective, your goal is to explore the OSF to track down an important piece of research, where it was created, and, of course, whodunit! The OSF Data Detective is played on a board representing the OSF with various communities that are utilizing the OSF infrastructure (like rooms in Clue). Each community is dedicated to a different field of study or initiative. Players take on the roles of data detectives, navigating through the OSF to collect clues. These clues are represented by ORCID identifiers for researchers, Digital Object Identifiers (DOIs) for pieces of research, and Research Organization Registry (ROR) for institutions. The goal is to determine which specific research (DOI) we need to cite, which researcher (ORCID) created it, and from which institution (ROR) it originated. Components: - Game Board: Illustrates the OSF communities. - Cards: Include ORCID IDs, DOIs, and RORs. - Player Pieces: Different data detectives. - Detective Notes: For tracking clues. - Envelope: Holds the secret solution cards. Setup: Secretly select a DOI, ORCID, and ROR card, placing them in the envelope. Shuffle and distribute remaining cards among players, with extras in the Research Commons. Players select their piece and start at the conference entrance. Gameplay: Roll a die to move and enter communities, making hypotheses about the research, researcher, and institution. Upon entering a community, suggest a DOI, ORCID, and ROR combination. Other players reveal if they have any cards from the suggestion. Use Detective Notes to track revealed information. Forming a Citation and Winning: Players who think they've solved the case form a citation, stating their solution and checking the envelope. Correct solutions win the game; incorrect guesses end the player's ability to suggest but not to participate in revealing cards. The winner successfully identifies the correct research, researcher, and institution.
Charting New Horizons: Advancing Data Consultation and Visualization at Clemson Library for a Connected Future
Stacie Powell (Clemson University)
This poster session, "Charting New Horizons: Advancing Data Consultation and Visualization at Clemson Library for a Connected Future," will highlight how Clemson Library has relaunched its data services to better serve its diverse user base. The initiative aimed to refine and expand data consultation and visualization services, making them more accessible, impactful, and effective. This effort was driven by the need to modernize and broaden the scope of the library's previously limited and outdated data services, which could no longer meet the growing demands of users. The project involved a comprehensive redesign and implementation process, which included assessing user needs, launching robust rebranding and marketing campaigns, developing new service models, and hiring a new team of Graduate Assistants to serve as data visualization experts and contribute to the consultation process. Key achievements included increased user engagement, higher patron satisfaction, and a more streamlined data consultation process. Despite challenges such as resource constraints and marketing setbacks, these obstacles were mitigated through collaboration with various University partners. We worked closely with several campus departments to offer engaging workshops on data visualization topics, which were well-attended and received. Additionally, we partnered with the Graduate School, which provided funding in exchange for workshops tailored to graduate students and data services professional development. The revamped services have significantly impacted both the library and its users, as demonstrated by quantitative and qualitative data reflecting enhanced user experiences and greater service utilization. Looking ahead, future plans include expanding the range of services offered, hiring a full-time Data Visualization Specialist, and continuing to innovate in data consultation and visualization. The long-term goal is to establish Clemson Library as a campus leader in data visualization services, fostering a connected and informed community.
Brick by Instructional Brick: Using Templates as a Scaffold to Teach Research Data Management
Lauren Phegley (University of Pennsylvania)
A common frustration with teaching data management to researchers in one shot sessions is not knowing if the lesson created longer lasting impacts. One method for supporting changes in data management behaviors is to provide templates that participants can use to implement the learning objectives after the session. Templates for topics like file naming, folder organization, and documentation offer a scaffolded learning experience while also being directly implementable to the researcher's workflow. This poster will introduce templates as an instructional scaffold, how they have been adopted in the author's instruction sessions, and success stories for how they have been used outside of instruction. A reference list of open access templates that can be used in teaching or shared independently will be provided on the poster.
Advancing FAIR + CARE Practices in Cultural Heritage.
Regina Roberts (Stanford University Libraries)
This poster will display information about the FAIR +CARE Network (IMLS grant projcet) and will cover information about: Developing digital data governance models, Examples of effective relationship building & collaborative practices for stewarding cultural heritage collections weaving CARE practices into a Data Management Plan. Respond to questions about praxis in your own work to help inform the network develop tools for ethical use of CARE principles in support of FAIR cultural heritage data. This will be an interactive portion of the poster, to hear from participants about their own FAIR + CARE practices.
Creating a Bibliography of Publications Analyzing or Discussing Library-Licensed Data
Kate Barron (Stanford University Libraries)
Among their many charges, academic libraries are responsible for developing collections that support institutional teaching and research. New and ongoing acquisitions must occur within the library's budget and operational parameters. Data are particularly costly acquisitions because they (1) demand high and/or ongoing license fees; (2) must be hosted on specialized computational infrastructure; and (3) must be curated and/or managed by specialized staff. Given these high costs, how can academic librarians assess the value of licensed data collections and justify their purchase? Because licensed data will generally have high cost-per-use, it may be more constructive to measure value using the volume and impact of research projects employing the data. This poster will describe Stanford Libraries' efforts to inventory scholarly literature that was (1) authored or edited by then Stanford-affiliated researchers; and (2) contains analysis or discussion of library-licensed data. These efforts are modeled on the ICPSR Bibliography of Data-related Literature: Originating Methodology (https://www-icpsr-umich-edu.stanford.idm.oclc.org/web/pages/ICPSR/citations/about.html). The poster will report the (1) purpose, goals and anticipated outcomes of the bibliography project; (2) project methodology, including publication inclusion criteria, search strategies and citation management; and (3) preliminary analysis of the compiled bibliography and how it does or does not demonstrate the value of licensed data collections.
Developing Collaborative Data Services and Instruction
Gabriella Evergreen (Cornell University)
Lencia McKee (Cornell University)
Research Data and Open Scholarship is a centralized library service at Cornell University that is charged with facilitating ethical stewardship and sharing of research and scholarship. While the formation of this group is relatively new, librarians at Cornell have been providing and developing data services to the research community for over 15 years. What began as basic data management planning has transformed into comprehensive services that encompass not only data planning and storage, but also sharing, long-term preservation, and the widespread adoption of persistent identifiers like ORCIDs and DOIs. These advancements have not only facilitated the creation of FAIR data (findable, accessible, interoperable, and reusable) but have also played a crucial role in enhancing research reproducibility and collaboration across disciplines. In the current research landscape, with increased awareness and adoption of data sharing and open scholarship practices, monitoring and responding to data management needs is more important than ever. Providing data management services not only helps researchers comply with funding agency and publisher requirements but also enhances the reproducibility and impact of their research. While we continue to support the scientific community, we also seek to extend our reach to non-scientific research communities. As data sharing mandates become more common in other fields (i.e. the new NEH Public Access Policy), the need for instruction and support around good data management practices in all fields of study are crucial. We are exploring ways to broaden our reach and develop targeted services for researchers in the interpretive social sciences and humanities, as well as early scholars who are producing nontraditional research outputs, and researchers who may not consider themselves as working with "data." Our outreach and instruction plan will require building partnerships with research and learning services librarians across the university, and perhaps changing the way we talk and teach about data.
Identifier Detective: Using AI to find information on Database Identifiers
Inku Subedi (Harvard Business School)
As business librarians, we often get these questions, "What is an Identifier – is it unique to companies?"; "What identifiers are in capital IQ ?"; "How is CUSIP different than CIK?"; "Can you help me match my dataset to get industry NAICS codes?". I created a chatbot called "Identifier Detective" that helps users get answers for these questions interactively. To create this chatbot, I use a spreadsheet that matches different financial databases such as Capital IQ, Bloomberg, LSEG workspace with corresponding identifiers such as CUSIP, GVKey, ISIN, SEDOL, CIK, Duns Number, Ticker. Users can customize their questions with the chat bot. The chat bot creates a folder with user's name and saves their answers in their own folder. If the user wants to find which databases have GVKEY and CIK – the chatbot will use the spreadsheet to provide the answer Capital IQ. It allows the user to upload their own dataset in their folder. For e.g. the user has data from Compustat and wants to know how to match the companies to get CIKs. Then, the chatbot will tell them their identifier is from Compustat and they can use GVKEY to match with Capital IQ and get CIKs for the companies. The chatbot provides more information on the Identifiers. For e.g. if users asks about CIK, it provides curated information about CIK that is in the spreadsheet. In this presentation, I will show how to create such a chatbot, show a live demo of the chatbot, limitations, and lessons learned when creating such chatbot.
A naïve method to treat a series of dichotomous variables as a single number
Flavio Bonifacio (Metis Ricerche)
There are many ways to treat a set of dichotomous variables. Factor analysis is one of them, just to quote one of the most used methods. The goal is always the dimensions reduction. For example, a battery of ten dichotomous items may be reduced to a few variables, according with some numerical benchmark (eigenvalues for example). This is done in order to facilitate the comprehension of the underlying phenomenon and to simplify the use of the analytic tools (contingency tables, regression, structural equation models and so on). In this exercise I try to show a way to analyse statistical relations between the set of dichotomous variables (as a dependent variable) and a set of independent features. The dependent set of dichotomous variables will be treated as a single numeric variable without a loss of precision: the poster will explain how to do. It must be warned that this is only an experimental work, in many cases results are not yet satisfying. Nevertheless they may serve to show a way.
June 3, 2025: Pre-Conference Workshops
Session theme/info: All workshops were held at UWE's Frenchay campus.
Bridging Design and Accessibility: Creating FAIR and Inclusive Visualizations
Sarah Siddiqui (University of Rochester)
Heather Charlotte Owen (University of Rochester)
This interactive workshop is aimed at making visualizations more appealing, accessible, and FAIR. Participants will explore strategies to determine the accessibility of a graph, learn tips for creating accessible graphs, and create their own visualizations with various tools. Beginning with the theory of design then diving into individual elements with hands-on exercises, participants will experience the full data visualization cycle and practice creating accessible graphs using software such as Excel, Tableau, RStudio, and Jupyter Notebooks. While all of these software are open-source and/or freely available (except Excel which is fairly ubiquitous), they differ significantly in how they visualize graphs, requiring distinct approaches towards accessibility. In addition, participants will learn how to create accessible tables and documentation to ensure a fully FAIR package that can be understood and further built upon by the public. As data sharing becomes increasingly prevalent, it is imperative that data professionals become knowledgeable on how visualizations can be shared in a way that is FAIR, but also accessible to individuals with disabilities. Raw data and code receive the most attention when it comes to ensuring research is FAIR, but visualizations should also receive the same treatment. While data and code increase the reproducibility of research, it is the visualizations that make it understandable to the general public. This workshop will help bridge this gap and encourage data professionals to expand the range of research outputs they assist researchers in producing.
AI-enabled data practices for metadata discovery and access: Best practices for developing training data
Wing Yan Li (University of Surrey)
Chandresh Pravin (University of Surrey)
Continued investment into new and existing data collection infrastructures (such as surveys and smart data), highlights the growing need for creation of efficient, robust and scalable data resources which help researchers find and access data. Recent advances in artificial intelligence (AI) methods to facilitate automatic analysis of large text collections provides a unique opportunity at the intersection of computational techniques and research methodologies for the development of data resources that are able to meet the current and future needs of the research community. With the widening application of AI and machine learning (ML) pipelines for processing large text corpora, this workshop focuses on a fundamental prerequisite before setting up any pipeline for downstream tasks: the Dataset. It is a common perception that ML models are data hungry and require a vast amount of data to enhance model performance. While understandable, this perception can sometimes overshadow the importance of data quality. In collaboration with CLOSER, this workshop will cover a typical "packaging" of data to train and evaluate models. The workshop will explore various aspects that contribute towards good practice for creating quality training datasets, including exploratory data analysis, selection of evaluation metrics, model selection and model evaluation. Conventionally, models are evaluated quantitatively, as represented by the appropriate metrics, and qualitatively. While it might be tedious to qualitatively analyse all the samples, random sampling could be problematic. In the section covering model evaluation, workshop participants will be introduced to the problem of data biases and gaps. By bridging technological approaches with social science research needs, this workshop offers an exploration of data transformation techniques that enhance research reproducibility and computational analysis capabilities.
Learn how to create synthetic data
Gillian Raab (University of Edinburgh)
Lynne Adair (University of Edinburgh)
The workshop will discuss the use of synthetic data for disclosure control. It will include discussions of what and how synthetic data can be used. Differen5t types of synthetic data will be discussed. A practical session on how to create low-fidelity synthetic data using the R package synthpop, will be included in the workshop and participants will be guided as to how they might proceed to create synthetic data with greater utility. The workshop will also include discussions of how to assess synthetic data for utility and disclosure control. We will also review examples of cases where synthetic data has been released to the public. If a full day were available we could also do an afternoon session on how to create high fidekity synthetic data
Teaching Qualitative Data Analysis using Open Data, Standards, and Tools
Sebastian Karcher (Qualitative Data Repository)
Nathaniel Porter (Virginia Tech)
Michael Beckstrand (University of Minnesota)
Teaching Computer-Assisted Qualitative Data Analysis (CAQDAS) can be a challenge: Qualitative data for instruction can be hard to find, CAQDAS data formats are proprietary as are all leading tools. In this "train the trainer" workshop, we show how instructors can leverage the recent opening of qualitative research infrastructure and build an effective training around open qualitative data (as can be found in the Qualitative Data Repository), open standards (here, the REFI-QDA standard for the exchange of qualitative data projects), and open tools (the open source QualCoder software). The workshop follows the structure of an "Introduction to Computer Assisted Qualitative Data Analysis" workshop, but focuses on the technical backgrounds and pedagogical issues that instructors face when teaching such a course. We begin with an introduction to the Qualitative Data Repository, the REFI-QDA format, and the QualCoder software. We discuss identifying qualitative data for teaching a workshop as well as common challenges in setting up and working with the QualCoder tool. We then jointly engage in an abbreviated version of a set of qualitative data coding exercises that highlights good practices in teaching the coding of qualitative data. The workshop concludes with some considerations for sharing coded qualitative data. The target audience for the workshop are principally data professionals with some familiarity with qualitative research who are considering providing support for qualitative researchers. That said, the workshop does not have any pre-requirements and is also open to adventurous beginners interested in learning about qualitative data analysis and its developing open infrastructure.
The Art of Transcription: Using open-source tools to optimize transcription processes
Maureen Haaker (UK Data Service)
One of the most significant challenges when working with qualitative data is the substantial amount of time and resources required to prepare text for analysis and sharing. A key step in this preparation is the process of converting audio files into text, or transcription. This task involves multiple elements, such as speaker segmentation, tagging, editing, anonymization, and notation. Each of these steps is essential for producing accurate, high-quality transcripts, but can also create substantial barriers to the efficient curation of qualitative data. In short, the effort involved in completing a full and accurate transcription can often slow down the overall workflow and limit the accessibility of data. However, despite these challenges, complete transcriptions are essential for ensuring data privacy, enabling secure data sharing, and maximizing the reuse and value of qualitative collections. When done well, transcription makes it possible to share sensitive research data in a way that respects confidentiality and privacy while making the data more accessible to other researchers. As such, transcription is just as vital for the curation and long-term management of qualitative data as it is for research analysis itself. This workshop is designed for individuals working with both audio and text data who are seeking solutions to streamline their transcription workflows. It will focus on exploring how open-source tools can ease the burden of transcription, notation, summarization, and anonymization. Participants will learn how to build a semi-automated curation pipeline that efficiently converts audio files into shareable, anonymized transcripts. They will also have the opportunity to discuss how these tools can be integrated into existing research workflows, helping to improve the efficiency and quality of transcriptions. By leveraging these powerful open-source tools, researchers can optimize their transcription process, reduce the workload involved, and critically evaluate the choices they make when handling qualitative data.
Embedding sensitive data management best practice in institutional workflows
Zosia Beckles (University of Bristol)
Kirsty Merrett (University of Bristol)
Alice Motes (University of Bath)
Christopher Tibbs (University of Exeter)
Kellie Snow (Cardiff University)
This half-day workshop will provide an overview of approaches to management of sensitive research data based on the collective experience of GW4 (https://gw4.ac.uk/) research data management support staff at the University of Bath, University of Bristol, Cardiff University, and University of Exeter, and how these approaches may be adapted for different institutional contexts. This will include design of consent forms, data management plans, and data sharing strategies, and how to embed this within the context of individual institutional workflows and policy frameworks. The workshop is structured as a train-the-trainer package, developed from material previously delivered as part of the UKRN Train-the-Trainer programme (https://www.ukrn.org/training/data-sharing-from-planning-to-publishing-8-may/). It will include both subject matter content and workshop/training delivery design. In the first part of the workshop we will deliver current GW4 best practice on design of consent forms to enable data sharing, data management planning for studies involving sensitive data, and strategies to enable sharing of sensitive data. Following this, there will be discussion of pedagogic approaches to delivering training on these topics. The last part of the workshop will be devoted to developing content based on the topics covered in the earlier parts – there will be space for attendees to develop their own training materials tailored to their own institutional contexts. Attendees will be able to design and test content blocks and receive feedback from peers and workshop organisers.
An Introduction to Census Data with R Using the "tidycensus" Package
Aditya Ranganath (University of Colorado Boulder)
Data Librarians and social science information professionals must often use Census data, whether it is for their own projects, or in order to assist researchers or students with their empirical projects. The process of extracting, querying, analyzing, and visualizing these data can be complex, however; as a result, academic libraries often subscribe to commercial software products that make this process more straightforward. However, these solutions are expensive, and may not be interoperable with the tools and platforms that researchers typically use to conduct empirical research. In recent years, robust open-source software packages to work with Census data (using the Census Bureau's API) have been developed; these packages provide efficient, cost-effective, and user-friendly pathways to exploring Census data. This workshop provides a hands-on introduction to one such package, namely, the "tidycensus" package, which allows users to interact with the Census API using the R programming language. Participants will learn how to identify, locate, and extract data from the Census (and associated datasets) using "tidycensus", and subsequently analyze these data using tools from the R "tidyverse." Completing the workshop will empower participants to use "tidycensus" in their own research, as well as in their consultations with students and researchers who must work with Census data in their projects. Please note that while our focus will be on the United States Census, tools to extract and work with Census and demographic data from other countries using the R programming language will also be introduced.