By LHorton 2 | April 3, 2017
For the third time IASSIST sponsored the International Digital Curation Conference. This time allowing three students, one each from Switzerland, Korea, and Canada to attend the conference, which titled itself “Upstream, Downstream: embedding digital curation workflows for data science, scholarship and society”.
Data science was a strong theme of the three keynote presentations, in particular how curation and data management are an active, integrated, ongoing parts of analysis rather than a passive epilogue in research.
Maria Wolters talked about how missing data can provide research insights analysing patterns of absence and, counter-intuitively, can improve the quality of datasets through the concept of managed forgetting –asking is it important to preserve and is it relevant at the moment – we can better manage and find data. Alice Daish showed her work as a data scientist at the British Museum, with the goal of enabling data informed decision-making. This involved identifying data “silos” and “wrangling” data in to exportable formats, along with zealous use and promotion of R, but also thinking about the way data is communicated to management. Chris Williams demonstrated how the Alan Turing Institute handles data mining. He reports that about 80 percent of work on data mining involves understanding and preparing data. This ranges from understanding formats and running descriptives to look for outliers and anomalies to cleaning untidy and inconsistent metadata and coding. The aim is to automate as much of this as possible with the Automatic Statistician project.
In a session on data policies, University of Toronto’s Dylanne Dearborn and Leanne Trimble showed how libraries can use creative thinking to matching publication patterns against journal data policies in providing support. Fieke Schoots outlined the approach at Leiden which includes requirements from PhD’s to state location of research data before their defence can take place and twenty year retention for Data Management Plans. Switching to journals, Ian Hrynaszkiewicz talked about the work Springer Nature has done to standardise journal data polices into one of four types allied with support for authors and editors on policy identification and implementation.
Ruth Geraghty dealt with ethical challenges in retro-fitting a data set for sharing. She introduced the Children’s Research Network for Ireland and Northern Ireland. This involved attempting to obtain consent from participants for sharing, but also work on anonymising the data to enable sharing. Although a problematic and resource intensive endeavour the result is not only a reusable data set but informed guidance for other projects on archiving and sharing. Niamh Moore has long experience of archiving her research and focused on another legacy archive – the Clayoquot Lives oral history project. Niamh is using Omeka as a sharing platform because it gives the researcher control of how the data can be presented for reuse. For example, Omeka has capacity for creating exhibits to showcase themes.
Community is important in both curation and management. Marta Teperek and Rosie Higman introduced work at Cambridge on collaborative communities and data champions. Finding a top-down compliance approach was not working, Cambridge moved to a bottom-up engagement style bringing researchers into decision-making on policies and support. Data champions are a new approach to seed advocates and trainers around the university as local contact points, based on a community of practice model. The rewards of this approach are potentially rich, but the cost of setting-up and managing it are high and the behaviour of the community is not always controllable. Two presentations on community/citizen science from Andrea Copeland and Peter Darch also hit on the theme of controlling groups in curating data. The Galaxy Zoo project found there were lessons to learn about the behaviour of volunteers, particularly the negative impact of a “league table” credit system in retaining contributors, and how volunteers expected to only contribute classifications were in some cases doing data science work in noticing unusual objects.
A topic of relevance to social science focused curation is sensitive data. Debra Hiom introduced University of Bristol’s method of providing safe access to sensitive data. Once again, it’s resource intensive - requiring a committee classification of data into levels of access and process reviews to ensure applications are genuine. However the result is that data that cannot be open can be shared responsibly. Sebastian Karcher from the Qualitative Data Archive spoke about managing sensitive data in the cloud, a task further complicated by the lack of a federal data protection law in the United States. Elizabeth Hull (Dryad) presented on developing an ethical framework for curating social media data. A common perception is social media posts are fair use, if made public. However, from an ethical perspective posters may not understand their “data” is being collected for research purposes and users need to know that use of @ or # on Twitter means they are inviting involvement and sharing in wider discussions. Hull offered a “STEP” approach as way to deal with social media data, balancing benefit of preservation and sharing against risk of harm and reasonable consent from research subjects.