Full Program »

Data Sheets in Practice: An Exploration of Machine Learning Dataset Descriptions

One of the approaches to address bias in Machine Learning algorithms has been a call for transparency and critical evaluation of the training data that feed into the Machine Learning models. Data descriptions are seen as a vehicle that can help to increase transparency and create awareness of potential shortcomings and ethical issues for predictive modeling. Among the most prominent is perhaps the proposal of "Datasheets for Datasets" by Gebru et al (2021). Data sheets provide information about provenance, use, and limitations and ask questions about the social and ethical implications of data production and use. The use of these templates is voluntary, suggestive, and modular. So how are they actually applied? Based on the analysis of about 62,000 data sets from Hugging Face, as well as a close reading of selected case studies this project attempts to gain insight into the practices of using Data Sheets for Machine Learning training data.

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723

Claudia Engel
Stanford University
United States