Already a member?

Sign In
Syndicate content

Blogs

I am he as you are he as you are me and we are all together

I'm just in the process of updating who we follow from our @iassistdata twitter account (we follow members who follow us - when I get round to updating things, sorry).

Given the huge* number of followers we now have, (595, thank you one and all) I thought it would be interesting to see what we looked like according to our twitter bios.

No surprises: we define ourselves as data people or organisations, in terms of "research", "librarian" (and library related terms), "social" "science", "digital", "information", and "universities". It suggests people following us are the type of people that should be following us given the organisation's goals, and hopefully are getting some value from following @iassistdata.

*Obviously a subjective assessment when Justin Beiber has 44,625,042.

 @iassistdata twitter follower bios

Finding Historical Economic Data through FRASER and ALFRED

The North Carolina Library Association's Government Resources Section had an excellent webinar yesterday on finding historical (or vintage) economic data using FRASER and ALFRED.  The recording and slides are available to everyone. Enjoy!

Sharing data: good for science, good for you

See video

DANS has published a video to promote storing and sharing data within the research community. The video is available in Dutch and English, and shown on the DANS Youtube channel. The title of the English video is 'Sharing data: good for science, good for you': http://youtu.be/HJbo-OAaJ1I

"Scientific research produces data. The lifetime of these data varies greatly. Stored on a hard disk or USB stick they are likely to be lost in the near future together with the storage medium. Luckily, there is another, more sustainable option, which benefits science.

In this video Dutch historian Martijn Kleppe (Erasmus University Rotterdam) explains why he opened up his big photo database for other researchers to use, and quantitative data analyst Manfred te Grotenhuis (Radboud University Nijmegen) speaks about the treasures in data archives that are waiting to be discovered by researchers.

Both scientists made use of the online archiving system EASY from DANS (Data Archiving and Networked Services) in the Netherlands. As an institute of KNAW and NWO, DANS promotes sustained access to digital research data."

Feedback is welcome.

Marion Wittenberg

Congratulations to Dan Tsang and Wendy Watkins!

As some of you may know, Dan Tsang and Wendy Watkins have been named the 2013 ICSPR Flanagan Award winners for distinguished service as an ICPSR OR, http://www.icpsr.umich.edu/icpsrweb/ICPSR/support/announcements/2013/07/icpsr-announces-2013-warren-e-miller

UC-Irvine recognizes Dan here, http://www.lib.uci.edu//features/spotlights/dt-award.html
Perhaps a Canadian colleague has a similar link for Wendy.

Congratulations to both Dan and Wendy!

The Role of Data Repositories in Reproducible Research

Cross posted from ISPS Lux et Data Blog

These questions were on my mind as I was preparing to present a poster at the Open Repositories 2013 conference in Charlottetown, PEI earlier this month. The annual conference brings the digital repositories community together with stakeholders, such as researchers, librarians, publishers and others to address issues pertaining to “the entire lifecycle of information.” The conference theme this year, “Use, Reuse, Reproduce,” could not have been more relevant to the ISPS Data Archive. Two plenary sessions bookended the conference, both discussing the credibility crisis in science. In the opening session, Victoria Stodden set the stage with her talk about the central role of algorithms and code in the reproducibility and credibility of science. In the closing session, Jean-Claude Guédon made a compelling case that open repositories are vital to restoring quality in science.

My poster, titled, “The Repository as Data (Re) User: Hand Curating for Replication,” illustrated the various data quality checks we undertake at the ISPS Data Archive. The ISPS Data Archive is a small archive, for a small and specialized community of researchers, containing mostly small data. We made a key decision early on to make it a "replication archive," by which we mean a repository that holds data and code for the purpose of being used to replicate and verify published results.

The poster presents ISPS Data Archive’s answer to the questions of who is responsible for the quality of data and what that means: We think that repositories do have a responsibility to examine the data and code we receive for deposit before making the files public, and that this data review involves verifying and replicating the original research outputs. In practice, this means running the code against the data to validate published results. These steps in effect expand the role of the repository and more closely integrate it into the research process, with implications for resources, expertise, and relationships, which I will explain here.
First, a word about what data repositories usually do, the special obligations reproducibility imposes, and who is fulfilling them now. This ties in with a discussion of data quality, data review, and the role of repositories.

Data Curation and Data Quality

A well-curated data repository is more than a place to put data. The Digital Curation Center (DCC) explains that data curation means ensuring data are accessible to designated users for first time use and reuse. This involves a set of curatorial practices – maintaining, preserving and adding value to digital research data throughout its lifecycle – which reduces threat to the long-term research value of the data, minimizes the risk of its obsolescence, and enables sharing and further research. An example of a standard-setting curation process is the Inter-university Consortium for Political and Social Research (ICPSR). This process involves organizing, describing, cleaning, enhancing, and preserving data for public use and includes format conversions, reviewing the data for confidentiality issues, creating documentation and metadata records, and assigning digital object identifiers. Similar data curation activities take place at many data repositories and archives.

These activities are understood as essential for ensuring and enhancing data quality. Dryad, for example, states that its curatorial team “works to enforce quality control on existing content.” But there are many ways to assess the quality of data. One criterion is verity: Whether the data reflect actual facts, responses, observations or events. This is often assessed by the existence and completeness of metadata. The UK’s Economic and Social Research Council (ESRC), for example, requests documentation of “the calibration of instruments, the collection of duplicate samples, data entry methods, data entry validation techniques, methods of transcription.” Another way to assess data quality is by its degree of openness. Shannon Bohle recently listed no less than eight different standards for assessing the quality of open data on this dimension. Others argue that data quality consists of a mix of technical and content criteria that all need to be taken into account. Wang & Strong’s 1996 article claims that, “high-quality data should be intrinsically good, contextually appropriate for the task, clearly represented, and accessible to the data consumer.” More recently, Kevin Ashley observed that quality standards may be at odds with each other. For example, some users may prize the completeness of the data while others their timeliness. These standards can go a long way toward ensuring that data are accurate, complete, and timely and that they are delivered in a way that maximizes their use and reuse.

Yet these procedures are “rather formal and do not guarantee the validity of the content of the dataset” (Doorn et al). Leaving aside the question of whether they are always adhered to, these quality standards are insufficient when viewed through the lens of “really reproducible research.” Reproducible science requires that data and code be made available alongside the results, to allow regeneration of the published results. For a replication archive, such as the ISPS Data Archive, the reproducibility standard is imperative.

Data Review

The imperative to provide data and code, however, only achieves the potential for verification of published results. It remains unclear as to how actual replication occurs. That’s where a comprehensive definition of the concept of “data review” can be useful: At ISPS, we understand data review to mean taking that extra step – examining the data and code received for deposit and verifying and replicating the original research outputs.

In a recent talk, Christine Borgman pointed out that most repositories and archives follow the letter, not the spirit, of the law. They take steps to share data, but they do not review the data. “Who certifies the data? Gives it some sort of imprimatur?” she asks. This theme resonated at Open Repositories. Stodden asked: “Who, if anyone, checks replication pre-publication?” Chuck Humphrey lamented the lack of an adequate data curation toolkit and best practices regarding the extent of data processing prior to ingest. And Guédon argued that repositories have a key role to play in bringing quality to the foreground in the management of science.

Stodden’s call for the provision of data and code underlying publication echoes Gary King’s 1995 definition of the “replication standard” as the provision of, “sufficient information… with which to understand, evaluate, and build upon a prior work if a third party could replicate the results without any additional information from the author.” Both call on the scientific community to take up replication for the good of science as a matter of course in their scientific work. However, both are vague as to how this can be accomplished. Stodden suggested at Open Repositories that this activity is community-dependent, often done by students or by other researchers continuing a project, and that community norms can be adjusted by rewarding high integrity, verifiable research. King, on the other hand, argues that “the replication standard does not actually require anyone to replicate the results of an article or book. It only requires sufficient information to be provided – in the article or book or in some other publicly accessible form – so that the results could in principle be replicated” (emphasis added in italics). Yet, if we care about data quality, reproducibility, and credibility, it seems to me that this is exactly the kind of review in which we should be engaging.

A quick survey of various stakeholders in the research data lifecycle reveals that data review of this sort is not widely practiced:

  • Researchers, on the whole, do not do replication tests as part of their own work, or even as part of the peer review process. In the future, they may be incentives for researchers to do so, and post-publication crowd-sourced peer review in the mold of Wikipedia, as promoted by Edward Curry, may prove to be a successful model.
  • Academic institutions, and their libraries, are increasingly involved in the data management process, but are not involved in replication as a matter of course (note some calls for libraries to take a more active role in this regard).
  • Large or general data repositories like Dryad, FigShare, Dataverse, and ICPSR provide useful guidelines and support varying degrees of file inspection, as well as make it significantly easier to include materials alongside the data, but they do not replicate analyses for the purpose of validating published results. Efforts to encourage compliance with (some of) these standards (e.g., Data Seal of Approval) typically regard researchers responsible for data quality, and generally leave repositories to self-regulate.
  • Innovative services, such as RunMyCode, offer a dissemination platform for the necessary pieces required to submit the research to scrutiny by fellow scientists, allowing researchers, editors, and referees to “replicate scientific results and to demonstrate their robustness.” RunMyCode is an excellent facilitator for people who wish to have their data and code validated; but it relies on crowd sourcing, and does not provide the service per se.
  • Some argue that scholarly journals should take an active role in data review, but this view is controversial. A document produced by the British Library recently recommended that, “publishers should provide simple and, where appropriate, discipline-specific data review (technical and scientific) checklists as basic guidance for reviewers.” In some disciplines, reviewers do check the data. The F1000 group identifies the “complexity of the relationship between the data/article peer review conducted by our journal and the varying levels of data curation conducted by different data repositories.” The group provides detailed guidelines for authors on what is expected of them to submit and ensures that everything is submitted and all checklists are completed. It is not clear, however, if they themselves review the data to make sure it replicates results. Alan Dafoe, a political scientist at Yale, calls for better replication practices in political science. He places responsibility on authors to provide quality replication files, but then also suggests that journals encourage high standards for replication files and that they conduct a “replication audit” which will “evaluate the replicability and robustness of a random subset of publications from the journal.”

The ISPS Data Archive and Reproducible Research

This brings us to the ISPS Data Archive. As a small, on-the-ground, specialized data repository, we are dedicated to serious data review. All data and code – as well as all accompanying files – that are made public via the Archive are closely reviewed and adhere to standards of quality that include verity, openness, and replication. In practice it means that we have developed curatorial practices that include assessing whether the files underlying a published (or soon to be published) article, and provided by the researchers, actually reproduce the published results.

This requires significant investment in staffing, relationships, and resources. The ISPS Data Archive staff has data management and archival skills, as well as domain and statistical expertise. We invest in relationships with researchers and learn about their research interests and methods to facilitate communication and trust. All this requires the right combination of domain, technical and interpersonal skills as well as more time, which translates into higher costs.

How do we justify this investment? Broadly speaking, we believe that stewardship of data in the context of “really reproducible research” dictates this type of data review. More specifically, we think this approach provides better quality, better science, and better service.

  • Better quality. By reviewing all data and code files and validating the published results, the ISPS Data Archive essentially certifies that all its research outputs are held to a high standard. Users are assured that code and data underlying publications are valid, accessible, and usable.
  • Better science. Organizing data around publications advances science because it helps root out error. “Without access to the data and computer code that underlie scientific discoveries, published findings are all but impossible to verify” (Stodden et al.) Joining the publication to the data and code combats the disaggregation of information in science associated with open access to data and to publications on the Web. In effect, the data review process is a first order data reuse case: The use of research data for research activity or purpose other than that for which it was intended. This places the Archive as an active partner in the scientific process as it performs a sort of “internal validity” check on the data and analysis (i.e., do these data and this code actually produce these results?).

    It’s important to note that the ISPS Data Archive is not reviewing or assessing the quality of the research itself. It is not engaged in questions such as, was this the right analysis for this research question? Are there better data? Did the researchers correctly interpret the results? We consider this aspect of data review to be an “external validity” check and one which the Archive staff is not in a position to assess. This we leave to the scientific community and to peer review. Our focus is on verifying the results by replicating the analysis and on making the data and code usable and useful.

  • Better service. The ISPS Data Archive provides high level, boutique service to our researchers. We can think of a continuum of data curation that progresses from a basic level where data are accepted “as is” for the purpose of storage and discovery, to a higher level of curation which includes processing for preservation, improved usability, and compliance, to an even higher level of curation which also undertakes the verification of published results.

This model may not be applicable to other contexts. A larger lab, greater volume of research, or simply more data will require greater resources and may prove this level of curation untenable. Further, the reproducibility imperative does not neatly apply to more generalized data, or to data that is not tied to publications. Such data would be handled somewhat differently, possibly with less labor-intensive processes. ISPS will need to consider accommodating such scenarios and the trade-offs a more flexible approach no doubt involves.

For those of us who care about research data sharing and preservation, the recent interest in the idea of a “data review” is a very good sign. We are a long way from having all the policies, technologies, and long-term models figured out. But a conversation about reviewing the data we put in repositories is a sign of maturity in the scholarly community – a recognition that simply sharing data is necessary, but not sufficient, when held up to the standards of reproducible research.

OR2013: Open Repositories Confront Research Data

Open Repositories 2013 was hosted by the University of Prince Edward Island from July 8-12. A strong research data stream ran throughout this conference, which was attended by over 300 participants from around the globe.  To my delight, many IASSISTers were in attendance, including the current IASSIST President and four Past-Presidents!  Rarely do such sightings happen outside an IASSIST conference.

This was my first Open Repositories conference and after the cool reception that research data received at the SPARC IR meetings in Baltimore a few years ago, I was unsure how data would be treated at this conference.  I was pleasantly surprised by the enthusiastic interest of this community toward research data.  It helped that there were many IASSISTers present but the interest in research data was beyond that of just our community.  This conference truly found an appropriate intersection between the communities of social science data and open repositories. 

Thanks go to Robin Rice (IASSIST), Angus Whyte (DCC), and Kathleen Shearer (COAR) for organizing a workshop entitled, “Institutional Repositories Dealing with Data: What a difference a ‘D’ makes!”  Michael Witt, Courtney Matthews, and I joined these three organizers to address a range of issues that research data pose for those operating repositories.  The registration for this workshop was capped at 40 because of our desire to host six discussion tables of approximately seven participants each.  The workshop was fully subscribed and Kathleen counted over 50 participants prior to the coffee break.  The number clearly expresses the wider interest in research data at OR2013.

Our workshop helped set the stage for other sessions during the week.  For example, we talked about environmental drivers popularizing interest in research data, including topics around academic integrity.  Regarding this specific issue, we noted that the focus is typically directed toward specific publication-related datasets and the access needed to support the reproducibility of published research findings.  Both the opening and closing plenary speakers addressed aspects of academic integrity and the role of repositories in supporting the reproducibility of research findings.  Victoria Stodden, the opening plenary speaker, presented a compelling and articulate case for access to both the data and computer code upon which published findings are based.  She calls herself a computational scientist and defends the need to preserve computer code as well as data to facilitate the reproducibility of scientific findings.  Jean-Claude Guédon, the closing plenary speaker, bracketed this discussion on academic integrity.  He spoke about scholarly publishing and how the commercial drive toward indicators of excellence has resulted in cheating.  He likened some academics to Lance Armstrong, cheating to become number one.  He feels that quality rather than excellence is a better indicator of scientific success.

Between these two stimulating plenary speakers, there was a number of sessions during which research data were discussed.  I was particularly interested in a panel of six entitled, “Research Data and Repositories,” especially because the speakers were from the repository community instead of the data community.  They each took turns responding to questions about what their repositories do now regarding research data and what they see happening in the future.  In a nutshell, their answers tended to describe the desire to make better connections between the publications in their repositories with the data underpinning the findings in these articles.  They also spoke about the need to support more stages of the research lifecycle, which often involves aspects of the data lifecycle within research.  There were also statements that reinforced the need for our (IASSIST’s) continued interaction with the repository community.  The use of readme files in the absence of standards-based metadata and other practices, where our data community has moved the best-practice yardstick well beyond, demonstrate the need for our communities to continue in dialogue. 

Chuck Humphrey

Ich bin ein IASSISTer

Topic:

From 28-31 May, GESIS - Leibniz Institute for the Social Sciences hosted the 39th Annual Conference of the International Association for Social Science Information Service and Technology, aka #iassist2013

IASSIST conferences provide an overview of what’s happening in information technology and data services and allow exchange of ideas between participants working in different backgrounds - from social science and humanities to information and computer science. The aim of this year's event was to help us move closer to the dream of technical and organizational measures that make research data discoverable and accessible.

Two-hundred and eighty five participants were welcomed to Cologne by GESIS President York Sure-Vetter ahead of a program of workshops, presentations, posters and discussions around this year’s topic of "Data Innovation: Increasing Accessibility, Visibility, and Sustainability".

The first day of the conference offered eight workshops, providing participants the opportunity to look at specific topics like licensing data, data visualization or DOI assignment. Sessions on a variety of tools and methods were also offered, specifically the OLAP analysis method, R open source software, and CharmStats - GESIS’s newly developed data harmonization software which was formally launched at IASSIST.

Over the following three days there were a total of three plenaries and 32 concurrent sessions organized in three tracks.

Presentations and discussions were concentrated in the track "Research Data Management" (RDM). This embraced a spectrum of topics related to all aspects of the data lifecycle. Emphasis was on policies, strategies and tools to support researchers in managing their research data. In addition presentations demonstrated various supporting collaborative infrastructures and virtual research environments at institutional, national or international level. Another focus was data citation and publications to enhance discoverability of data and professional credit for data sharing. Additional discussion offered answers to the question of how responsible use of complex or sensitive data can be facilitated. Finally sessions in the RDM track dedicated themselves to the subject of data curation and long-term preservation.

The track "Data Developers and Tools" presented a technical point of view with offerings from those working in application development – seasoning their work with a good dash of metadata. Questions were asked and solutions presented on the topics of interoperability, interconnection and integration, and preservation of data. A special role here is played by the DDI metadata standard to which many tools and applications have been introduced to simplify the creation and management of DDI metadata or provide value-added services on setting the standard up.

The track "Data Public Services/Librarianship" confronted aspects of access to research data. Here, development of data services from country-specific perspective (Bosnia and Herzegovina, Serbia, Croatia) was highlighted, but the track also managed to look at specific data types (non-digital, historical, confidential and sensitive data).

Slides of the presentations and video recordings of selected events will be published in the coming weeks on the IASSIST website providing you an opportunity to plunge into the world IASSIST 2013. Let’s do it all again in Toronto for IASSIST 2014!

Astrid Recker, Laurence Horton, Alexia Katsanidou
GESIS Archive and Data Management Training Center 

IASSIST 2013 by the numbers

  • 285 participants from 29 countries (a new IASSIST record!)
  • Nearly two-thirds from Europe (64%), one-third from North America
  • 2 Participants from Africa and 9 from the Asia-Pacific region

Top 5 countries represented

  • Germany: 88
  • United States: 66
  • UK: 32
  • Canada: 23
  • Netherlands: 10

Activity

  • 8 workshops with 103 participants
  • 32 parallel sessions featuring 126 presentations
  • 35 Posters
  • 11 Pecha Kuchas
  • 3 Plenary Sessions
  • 2 songs
  • 1 Banquet
  • Lots of white asparagus served
  • Many glasses of Kölsch drunk
  • ∞ Complaints about the venue Wi-Fi

IASSIST needs YOU to write a blog post about some aspect of the conference

Topic:

I could say this is needed because the videos and wifi weren't working, but in fact I'd be asking anyway.

Just think how helpful it is for members who could not be with us (and potential members) to get a snapshot of views about what happened. Serious, silly, short, verbose, objective, or ridiculously opinionated impressions of one session, the social events or overall are very welcome.

Turn your personal notes into a gift that keeps on giving.

If you have any trouble posting (as a member) simply contact iassistwebmaster@gmail.com and we'll sort you out.

Robin Rice

IASSIST Communications Chair & Website Editor

IASSIST 2013 conference song

Topic:
See video

Many thanks to Kate, Dan, and especially Melanie for this year's bit of silliness. Thanks once again to Lynda Kellam for the video.

(Sung to the tune of "Lili Marlene" If anyone wonders, the connection of the melody with Cologne is that the composer, Norbert Schultze, once studied in Cologne.)

It's 2013 IASSIST, and welcome to Cologne
GESIS has endeavored to make us feel at home
Data innovation was the theme
And wifi problems made the scene
We struggled to communicate
Some tweets might turn up late

Monday was for meetings, workshops the next day
Curation, access, OLAP, and R all made their play
At the reception we all met
The Kolsch we drank helped us to set
The tone for what next came
More walking in the rain

Wednesday was the day when it all really began
Attendance at the talks at Maternushaus to plan
The talks on RDM were cool
As well as Services and Tools
Kristin's duckie charmed the crew,
Sam's rabbit joined in too

Wifi issues carried on, tweets were very few
Posters and pecha kuchas were quite enough to do
Data parachutes were opened, then
We walked to Rheinterrassen
With Spargel on the menu -
'Asparagus' to you

Friday was the day when winding down began
Next year's LAC began to talk about the plan
This year's IASSIST was really great
For next year's 40th you'll wait
Toronto welcomes you
With wifi access too.

New Latin American Open Data site!

Miguel Paz writes:

Poderomedia Foundation and PinLatam are launching OpenDataLatinoamerica.org, a regional data repository to free data and use it on Hackathons and other activities by HacksHackers chapters and other organizations.

We are doing this because the road to the future of news has been littered with lost datasets. A day or so after every hackathon and meeting where a group has come together to analyze, compare and understand a particular set of data, someone tries to remember where the successful files were stored. Too often, no one is certain. Therefore with Mariano Blejman we realized that we need a central repository where you can share the data that you have proved to be reliable: OpenData Latinoamerica, which we are leading as ICFJ Knight International Journalism Fellows.

If you work in Latin America or Central America your organization can take part in OpenDataLatinoamerica.org. To apply, go to the website and answer a simple form agreeing to meet the standard criteria for open data. Once the application is approved, you will receive an account to start running and managing open data, becoming part of the community.

  • Iassist Quarterly

    Publications Welcome to the special double issue 3 & 4 of the IASSIST Quarterly (IQ) volume 36 (2012). This special issue addresses the organizational dimension of digital preservation as it was presented and discussed at the IASSIST conference in May 2013 in Cologne, Germany.

    more...

  • Resources

    Resources

    A space for IASSIST members to share professional resources useful to them in their daily work. Also the IASSIST Jobs Repository for an archive of data-related position descriptions. more...

  • community

    • LinkedIn
    • Facebook
    • Twitter

    Find out what IASSISTers are doing in the field and explore other avenues of presentation, communication and discussion via social networking and related online social spaces. more...