IASSIST 2025: IASSIST at 50! Bridging oceans, harbouring data & anchoring the future


Sharing is Caring (about Research): Addressing Challenges in Sharing Protected Text Data Collections through Non-Consumptive Research

Computational social scientists increasingly rely on "found" data—such as videos, images, and text—rather than "designed" data generated through traditional experiments and surveys. This shift offers advantages, as found data derived from online behavioral traces can mitigate issues like ecological validity, social desirability bias, and recall bias. However, using found data also presents significant challenges, particularly regarding data ownership and sharing. While collection and analysis of online data often fall under fair use principles, these exceptions typically do not extend to making the data available for follow-up research or reproduction. This creates a fundamental tension between adhering to legal, privacy, and ethical standards and the principles of transparent, reproducible research.

In this contribution, we address the legal, ethical, and technical challenges of sharing text data collections by proposing three complementary strategies:

1. Distributing pre-processed text versions that prevent reconstruction of the original content, thereby protecting data owners' interests 2. Sharing metadata that enables data reconstruction if the original data is still availabe online 3. Making non-consumptive research capabilities available that allow comprehensive data analysis without directly consuming (i.e., reading) the text

Non-consumptive research, pioneered by Google Books, which allows users to search for specific keywords within books without displaying the entire text, can involve simple keyword searches and frequency analyses, but also more sophisticated techniques.

The three avenues are not mutually exclusive and can be strategically combined to maximize text dataset sharing within legal and ethical constraints. To demonstrate the practical implementation of these strategies, we have developed software tools that operationalize them.

By anchoring our approach in the dual principles of reproducibility and ethical responsibility, this research bridges the divide between data accessibility and protection. Our work offers a forward-looking pathway for sharing text data that maintains the intellectual integrity of computational social science while respecting privacy and legal considerations.

Johannes B. Gruber
VU Amsterdam
Netherlands

Wouter van Atteveldt
VU Amsterdam
Netherlands