IASSIST 2025: IASSIST at 50! Bridging oceans, harbouring data & anchoring the future


Not dirty, not clean: the language of making changes to your data is a literacy issue

Specialization often leads to - if not necessitates - the coining and use of technical terms and jargon specific to a discipline. Many fields engage in data science and computational research and yet there isn’t a universal, shared language for one of the most fundamental steps of working with data: some refer to it as cleaning, some are merely tidying, others wrangle, scrub, process, munge, manipulate, transform, et al.. These terms refer to everything from correcting encoding errors and typos to large scale content moderation, from normalization of results to altering data for nefarious purposes. And because data can pass through many hands and be used for a range of studies in different fields, these changes can have significant near and far consequences to methodologies, analyses, reproducibility, and much more. Some of these terms have already been identified as problematic – in particular, “data cleaning” and the related term, “tidying” – by data feminists, information scientists, and others. And in at least one discipline, a case has been made to assign specific (arguably more accurate) meaning to it which has not carried over to other fields. The lack of precision of these terms also obscures the reality of working with data in research; not only is “clean data” an inappropriate misnomer, but its use negatively affects expectations and belies the importance of the labor it refers to and its potential repercussions. Drawing on the work of and case studies from law, business, and economics in addition to those listed above, this presentation will take a multi-disciplinary look at what these terms refer to, the implications of the language used, and argues that a universal approach is a critical data literacy issue.

Carol Choi
New York University
United States