Already a member?

Sign In
Syndicate content

Digital Repositories

IQ 40:1 Now Available!

Our World and all the Local Worlds
Welcome to the first issue of Volume 40 of the IASSIST
Quarterly (IQ 40:1, 2016). We present four papers in this issue.
The first paper presents data from our very own world,
extracted from papers published in the IQ through four
decades. What is published in the IQ is often limited in
geographical scope and in this issue the other three papers
present investigations and project research carried out at
New York University, Purdue University, and the Federal
Reserve System. However, the subject scope of the papers
and the methods employed bring great diversity. And
although the papers are local in origin they all have a strong
focus for generalization in order to spread the information
and experience.


We proudly present the paper that received the 'best
paper award' at the IASSIST conference 2015. Great thanks
are expressed to all the reviewers who took part in the
evaluation! In the paper 'Social Science Data Archives: A
Historical Social Network Analysis' the authors Kristin R.
Eschenfelder (University of Wisconsin-Madison), Morgaine
Gilchrist Scott, Kalpana Shankar, and Greg Downey
are reporting on inter-organizational influence and
collaboration among social science data archives through
data of articles published in IASSIST Quarterly in 1976
to 2014. The paper demonstrates social network analysis
(SNA) using a web of 'nodes' (people/authors/institutions)
and 'links' (relationships between nodes). Several types
of relationships are identified: influencing, collaborating,
funding, and international. The dynamics are shown in
detail by employing five year sections. I noticed that from
a reluctant start the amount of relationships has grown
significantly and archives have continuously grown better
at bringing in 'influence' from other 'nodes'. The paper
contributes to the history of social science data archives and
the shaping of a research discipline.


The paper 'Understanding Academic Patrons’ Data Needs
through Virtual Reference Transcripts: Preliminary Findings
from New York University Libraries' is authored by Margaret
Smith and Jill Conte who are both librarians at New York
University, and Samantha Guss, a librarian at University
of Richmond who worked at New York University from
2009-14. The goal of their paper is 'to contribute to the
growing body of knowledge about how information
needs are conceptualized and articulated, and how this
knowledge can be used to improve data reference in an
academic library setting'. This is carried out by analysis of
chat transcripts of requests for census data at NYU. There is
a high demand for the virtual services of the NYU Libraries
and there are as many as 15,000 annual chat transactions.
There has not been much qualitative research of users'
data needs, but here the authors exemplify the iterative
nature of grounded theory with data collection and analysis
processes inextricably entwined and also using a range of
software tools like FileLocator Pro, TextCrawler, and Dedoose.
Three years of chat reference transcripts were filtered down
to 147 transcripts related to United States and international
census data. The unique data provides several insights,
shown in the paper. However, the authors are also aware of
the limitations in the method as it did not include whether
the patron or librarian considered the interaction successful.
The conclusion is that there is a need for additional librarian
training and improved research guides.


The third paper is also from a university. Amy Barton, Paul
J. Bracke, Ann Marie Clark, all from Purdue University,
collaborated on the paper 'Digitization, Data Curation,
and Human Rights Documents: Case Study of a Library
Researcher-Practitioner Collaboration'. The project
concerns the digitization of Urgent Action Bulletins of
Amnesty International from 1974 to 2007. The political
science research centered on changes of transnational
human rights advocacy and legal instrumentation, while
the Libraries’ research related to data management,
metadata, data lifecycle, etcetera. The specific research
collaboration model developed was also generalized for
future practitioner-librarian collaboration projects. The
project is part of a recent tendency where academic
libraries will improve engagement and combine activities
between libraries and users and institutions. The project
attempts to integrate two different lifecycle models thus
serving both research and curatorial goals where the
central question is: 'can digitization processes be designed
in a manner that feeds directly into analytical workflows
of social science researchers, while still meeting the
needs of the archive or library concerned with long-term
stewardship of the digitized content?'. The project builds
on data of Urgent Action Bulletins produced by Amnesty
International for indication of how human rights concerns
changed over time, and the threats in different countries
at different periods, as well as combining library standards
for digitization and digital collections with researcher-driven
metadata and coding strategies. The data creation
started with the scanning and creation of the optical
character recognized (OCR) version of full text PDFs for text
recognition and modeling in NVivo software. The project
did succeed in developing shared standards. However, a
fundamental challenge was experienced in the grant-driven
timelines for both library and researcher. It seems to me that
the expectation of parallel work was the challenge to the
project. Things take time.


In the fourth paper we enter the case of the Federal Reserve
System. San Cannon and Deng Pan, working at the Federal
Reserve Bank in Kansas City and Chicago, created a pilot
for an infrastructure and workflow support for making the
publication of research data a regular part of the research
lifecycle. This is reported in the paper 'First Forays into
Research Data Dissemination: A Tale from the Kansas City
Fed'. More than 750 researchers across the system produce
yearly about 1,000 journal articles, working papers, etcetera.
The need for data to support the research has been
recognized, and the institution is setting up a repository
and defining a workflow to support data preservation
and future dissemination. In early 2015 the internal Center
for the Advancement of Research and Data in Economics
(CADRE) was established with a mission to support, enhance,
and advance data or computationally intensive research,
and preservation and dissemination were identified as
important support functions for CADRE. The paper presents
details and questions in the design such as types of
collections, kind and size of data files, and demonstrates
influence of testers and curators. The pilot also had to
decide on the metadata fields to be used when data is
submitted to the system. The complete setup including
incorporated fields was enhanced through pilot testing and
user feedback. The pilot is now being expanded to other
Federal Reserve Banks.


Papers for the IASSIST Quarterly are always very welcome.
We welcome input from IASSIST conferences or other
conferences and workshops, from local presentations or
papers especially written for the IQ. When you are preparing
a presentation, give a thought to turning your one-time
presentation into a lasting contribution. We permit authors
'deep links' into the IQ as well as deposition of the paper in
your local repository. Chairing a conference session with
the purpose of aggregating and integrating papers for a
special issue IQ is also much appreciated as the information
reaches many more people than the session participants,
and will be readily available on the IASSIST website at
http://www.iassistdata.org.


Authors are very welcome to take a look at the instructions
and layout: http://iassistdata.org/iq/instructions-authors.

Authors can also contact me via e-mail: kbr@sam.sdu.dk.
Should you be interested in compiling a special issue for
the IQ as guest editor(s) I will also be delighted to hear
from you.


Karsten Boye Rasmussen
June 2016
Editor

Looking Back/Moving Forward - Reflections on the First Ten Years of Open Repositories

Open Repositories conference celebrated its first decade by having four full days of exciting workshops, keynotes, sessions, 24/7 talks, and development track and repository interest group sessions in Indianapolis, USA. All the fun took place in the second week of June. The OR2015 conference was themed "Looking Back/Moving Forward: Open Repositories at the Crossroads" and it brought over 400 repository developers and managers, librarians and library IT professionals, service providers and other experts to hot and humid Indy.

Like with IDCC earlier this year, IASSIST was officially a supporter of OR2015. In my opinion, it was a worthy investment given the topics covered, depth and quality of presentations, and attendee profile. Plus I got to do what I love - talk about IASSIST and invite people to attend or present in our own conference.

While there may not be extremely striking overlap with IASSIST and OR conferences, I think there are sound reasons to keep building linkages between these two. Iassisters could certainly provide beneficial insight on various RDM questions and also for instance on researchers' needs, scholarly communication, reusing repository content, research data resources and access, or data archiving and preservation challenges. We could take advantage of the passion and dedication the repository community shows in making repositories and their building blocks perfect. It's quite clear that there is a lot more to be achieved when repository developers and users meet and address problems and opportunities with creativity and commitment.

 

While IASSIST2015 had a plenary speaker from Facebook, OR had keynote speakers from Mozilla Science Lab and Google Scholar. Mozilla's Kaitlin Thaney skyped a very interesting opening keynote (that is what you resort to when thunderstorms prevent your keynote speaker from arriving!) on how to leverage the power of the web for research. Distributed and collaborative approach to research, public sharing and transparency, new models of discovery and freedom to innovate and prototype, and peer-to-peer professional development were among the powers of web-enabled open science.
 
Anurag Acharya from Google gave a stimulating talk on pitfalls and best practices on indexing repositories. His points were primarily aimed at repository managers fine-tuning their repository platforms to be as easily harvestable as possible. However, many of his remarks are worth taking into account when building data portals or data rich web services. On the other, hand it can be asked if it is our job (as repository or data managers) to make things easy for Google Scholar, or do we have other obligations that put our needs and our users first. Often these two are not conflicting though. What is more notable from my point of view was Acharya's statement that Google Scholar does not index other research outputs (data, appendixes, abstracts, code…) than articles from the repositories. But should it not? His answer was that it would be lovely, but it cannot be done efficiently because these resources are not comprehensive enough, and it would not possible for example to properly and accurately link users to actual datasets from the index. I'd like to think this is something for IASSIST community to contemplate.

Open Researcher and Contributor ID (ORCID) had a very strong presence in OR2015. ORCID provides an open persistent identifier that distinguishes a researcher from every other researcher, and through their API interfaces that ID can be connected to organisational and inter-organisational research information systems, helping to associate researchers and their research activities. In addition to a workshop on ORCID APIs there were many presentations about ORCID integrations. It seems that ORCID is getting close to reaching a critical mass of users and members, allowing it to take big leaps in developing its services. However, it still remains to be seen how widely it will be adopted. For research data archiving purposes having a persistent identifier provides obvious advantages as researchers are known to move from one organisation to another, work cross-nationally, and collaborate across disciplines.

Many presentations at least partly addressed familiar but ever challenging research data service questions on deposits, providing data services for the researcher community and overcoming ethical, legal or institutional barriers, or providing and managing a trustworthy digital service with somewhat limited resources. Check for example Andrew Gordon's terrific presentation on Databrary, a research-centered repository for video data. Metadata harmonisation, ontologies, putting emphasis on high quality metadata and ensuring repurposing of metadata were among the common topics as well, alongside a focus on complying with standards - both metadata and technical.

I see there would be a good opportunity and considerable common ground for shared learning here, for example DDI and other metadata experts to work with repository developers and IASSIST's data librarians and archivists to provide training and take part in projects which concentrate on repository development in libraries or archives.

Keynotes and a number of other sessions were live streamed and recorded for later viewing. Videos of keynotes and some other talks and most presentation slides are available already, rest of the videos will be available in the coming weeks.

A decade against decay: the 10th International Digital Curation Conference

The International Digital Curation Conference (IDCC) is now ten years old. On the evidence of its most recent conference, is in rude health and growing fast.

IDCC is the first time IASSIST decided to formally support another organisational conference. I think it was a wise investment given the quality of plenaries, presentations, posters, and discussions.

DCC already has available a number of blogs covering the substance of sessions, including an excellent summary by IASSIST web editor, Robin Rice. Presentations and posters are already available, and video from plenary sessions will soon be online.

Instead I will use this opportunity to pick-up on hanging issues and suggestions for future conferences.

One was apportionment of responsibility. Ultimately, researchers are responsible for management of their data, but they can only do so if supporting infrastructure is in place to help them. So, who is responsible for providing that: funders or institutions? This theme emerged in the context of the UK’s Engineering and Physical Sciences Research Council who will soon enforce expectations identifying the institution as responsible for supporting good Research Data Management.

Related to that was a discussion on the role of libraries in this decade. Are they relevant? Can they change to meet new challenges? Starting out as a researcher who became a data archivist and is now a librarian, I wouldn’t be here if libraries weren’t meeting these challenges. There’s a “hush” of IASSIST members also ready to take issue with the suggestions libraries aren’t relevant or not engaged with data, in fact they did so at our last conference.

Melissa Terras, (UCL) did a fantastic job presenting [PDF] work in the digital humanities that is innovative in not only preserving, but rescuing objects – and all done on small change research budgets. I hope a future IDCC finds space for a social sciences person to present on issues we face in preservation and reuse. Clifford Lynch (CNI) touched on the problems of data reuse and human subjects, which remained one of the few glancing references to a significant problem and one IASSIST members are addressing. Indeed, thanks must go to a former president of this association, Peter Burhill (Edinburgh) who mentioned IASSIST and how it relates to the IDCC audience on more than one occasion.

Finally, if you were stimulated by IDCC’s talk of data, reuse, and preservation then don’t forget our own conference in Minneapolis later this year.

Feedback on Data Storage

I posted the following question to the listserv:

"I'm in the early days of exploring what I and our library can do for our faculty and grad students. In my case I'm particularity interested in the social sciences.

It seems there are three main choices:

1. ICPSR(or other domain-specific site)

2. Dataverse with my own school's branding

3. Local, campus funded storage through an Institutional Repository or something else that can handle larger amounts of data.


Our university is kind of in the vast middle
as far as flagship state universities go in budgets and research activity.

What are the pros and cons of these archiving choices? What would best suit a non-wealthy institution? Which requires more training and expertise?"

From the very informative feedback I received from my IASSIST colleagues, I concluded that it is best to keep open to all kinds of possibilities. I was probably naïve in my initial hope that there would be one solution on which I could train my energies. However that is not the case. Different solutions may be best for different factors, including the data in question, local staff skills,  and library budgets.

There were many voices that supported the domain-specific repository idea represented by ICPSR. Researchers can get exposure to colleagues in their areas of expertise. There is no need to reinvent the wheel if the expertise and the longevity that ICPSR can provide are out there. In addition, ICPSR is launching “openICPSR,” a new open access repository for researchers and institutions that need to comply with Federal requirements to make data publicly available.  Data deposited in "openICPSR" will be discoverable in the ICPSR catalog, but not restricted to ICPSR members -- anyone will be able to download.  ICPSR staff will edit the metadata appearing in the catalog, and depositors can commission full curation of their collections (e.g. full codebooks, variable-level metadata for searching) by ICPSR staff. In addition to accepting individual projects, openICPSR will also offer packages to meet institutional needs.  They are planning at least two options: 1) A multiple deposit option whereby an entity can purchase several project deposits (fees will be discounted for member institutions), and 2) A branded repository page that will list datasets under an institution's own logo and color scheme.

Many others outlined the Dataverse picture. If you can get a good match between what your campus needs and what Dataverse can provide, this can be a crucial part of an overall solution.  Dataverse has ease of entry through a self-service deposit structure, not to mention that the price is right (free)! Many institutions are starting with pilot projects in order to assess the labor impact on the library. A few librarians noted that there are issues of long-term storage, sustainability, and metadata uniformity that can arise with Dataverse.

Some respondents hastened to add that Dataverse will be offering improved services.  Dataverse is extending support for additional metadata standards in various scientific domains including biomedical ontologies, astronomy and updating to DDI codebook 2.5 (in the future, support for DDI Lifecycle). They are also extending the search, data exploration and analysis for tabular datasets (with histograms, cross-tabs, enhance descriptive stats, model selection). In addition they are also extending Data/Metadata API and data deposit API, and rich ingest for additional data types. 

Local solutions, including formal Institutional Repositories (IRs) and other storage services through a variety of campus resources did not emerge as a popular topic in the posts I received. One librarian commented on the resources in personnel and money that may be needed in IRs to deliver strong service for larger deposits.

Steve McGinty

Social Sciences Librarian

University of Massachusetts - Amherst

White Paper Urges New Approaches to Assure Access to Scientific Data

Press release posted on behalf of Mark Thompson-Kolar, ICPSR.

12/12/2013:  (Ann Arbor, MI)—More than two dozen data repositories serving the social, natural, and physical sciences today released a white paper recommending new approaches to funding sharing and preservation of scientific data. The document emphasizes the need for sustainable funding of domain repositories—data archives with ties to specific scientific communities.

“Sustaining Domain Repositories for Digital Data: A White Paper,” is an outcome of a meeting convened June 24-25, 2013, in Ann Arbor. The meeting, organized by the Inter-university Consortium for Political and Social Research (ICPSR) and supported by the Alfred P. Sloan Foundation, was attended by representatives of 22 data repositories from a wide spectrum of scientific disciplines.

Domain repositories accelerate intellectual discovery by facilitating data reuse and reproducibility. They leverage in-depth subject knowledge as well as expertise in data curation to make data accessible and meaningful to specific scientific communities. However, domain repositories face an uncertain financial future in the United States, as funding remains unpredictable and inadequate. Unlike our European competitors who support data archiving as necessary scientific infrastructure, the US does not assure the long-term viability of data archives.

“This white paper aims to start a conversation with funding agencies about how secure and sustainable funding can be provided for domain repositories,” said ICPSR Director George Alter. “We’re suggesting ways that modifications in US funding agencies’ policies can help domain repositories to achieve their mission.”

Five recommendations are offered to encourage data stewardship and support sustainable repositories: 

  •  Commit to sustaining institutions that assure the long-term preservation and viability of research data
  • Promote cooperation among funding agencies, universities, domain repositories, journals, and other stakeholders 
  •  Support the human and organizational infrastructure for data stewardship as well as the hardware
  •  Establish review criteria appropriate for data repositories
  • Incentivize Principal Investigators (PIs) to archive data

While a single funding model may not fit all disciplines, new approaches are urgently needed, the paper says.

“What’s really remarkable about this effort—the meeting and the resulting white paper—has been the consensus across disciplines from astronomy to archaeology to proteomics,” Alter said. “More than two dozen domain repositories from so many disciplines are saying the same thing: Data sharing can produce more science, but data stewards must know the needs of their scientific communities.”

This white paper is a must read for anyone who wants to understand the role of scientific domain repositories and their critical role in the advancement of science. It can be downloaded at http://datacommunity.icpsr.umich.edu

 

The Inter-university Consortium for Political and Social Research (ICPSR), based in Ann Arbor, MI, is the largest archive of behavioral and social science research data in the world. It advances research by acquiring, curating, preserving, and distributing original research data. www.icpsr.umich.edu

The Alfred P. Sloan Foundation is a philanthropic, not-for-profit grantmaking institution based in New York City. Established in 1934, the Foundation makes grants in support of original research and education in science, technology, engineering, mathematics, and economic performance. www.sloan.org

###

re3data.org and OpenAIRE sign MoU during Open Access Week; new re3data.org features

Last month, OpenAIRE (Open Access Infrastructure for Research in Europe) and re3data.org signed a Memorandum of Agreement to “work jointly to facilitate research data registration, discovery, access and re-use” in support of open science.  OpenAIRE is an infrastructure for open access that works to track and measure research output (originally designed to monitor EU funding activities).  re3data.org is an online listing of research data repositories.

re3data.org and OpenAIRE will exchange metadata in order for OpenAIRE to “integrate data repositories indexed in the re3data.org registry and in turn return information about usage statistics for datasets and inferred links between data and publications.”

For more information, see the OpenAIRE press release on the MoU.

In addition, re3data.org is now mentioned in Nature's Scientific data's deposition policy, which encourages the registration of repositories with the service, as well as a collaboration with BioSharing.

In addition, re3data.org has made other recent enhancements, including:

Now users can browse re3data.org repositories by:

  1. subject
  2. content type
  3. country

Furthermore, a re-designed the repository record now groups information into the categories of: general, institutions, terms, and standards.  They have added many more repositories in the past few months, so check it out!

The Role of Data Repositories in Reproducible Research

Cross posted from ISPS Lux et Data Blog

These questions were on my mind as I was preparing to present a poster at the Open Repositories 2013 conference in Charlottetown, PEI earlier this month. The annual conference brings the digital repositories community together with stakeholders, such as researchers, librarians, publishers and others to address issues pertaining to “the entire lifecycle of information.” The conference theme this year, “Use, Reuse, Reproduce,” could not have been more relevant to the ISPS Data Archive. Two plenary sessions bookended the conference, both discussing the credibility crisis in science. In the opening session, Victoria Stodden set the stage with her talk about the central role of algorithms and code in the reproducibility and credibility of science. In the closing session, Jean-Claude Guédon made a compelling case that open repositories are vital to restoring quality in science.

My poster, titled, “The Repository as Data (Re) User: Hand Curating for Replication,” illustrated the various data quality checks we undertake at the ISPS Data Archive. The ISPS Data Archive is a small archive, for a small and specialized community of researchers, containing mostly small data. We made a key decision early on to make it a "replication archive," by which we mean a repository that holds data and code for the purpose of being used to replicate and verify published results.

The poster presents ISPS Data Archive’s answer to the questions of who is responsible for the quality of data and what that means: We think that repositories do have a responsibility to examine the data and code we receive for deposit before making the files public, and that this data review involves verifying and replicating the original research outputs. In practice, this means running the code against the data to validate published results. These steps in effect expand the role of the repository and more closely integrate it into the research process, with implications for resources, expertise, and relationships, which I will explain here.
First, a word about what data repositories usually do, the special obligations reproducibility imposes, and who is fulfilling them now. This ties in with a discussion of data quality, data review, and the role of repositories.

Data Curation and Data Quality

A well-curated data repository is more than a place to put data. The Digital Curation Center (DCC) explains that data curation means ensuring data are accessible to designated users for first time use and reuse. This involves a set of curatorial practices – maintaining, preserving and adding value to digital research data throughout its lifecycle – which reduces threat to the long-term research value of the data, minimizes the risk of its obsolescence, and enables sharing and further research. An example of a standard-setting curation process is the Inter-university Consortium for Political and Social Research (ICPSR). This process involves organizing, describing, cleaning, enhancing, and preserving data for public use and includes format conversions, reviewing the data for confidentiality issues, creating documentation and metadata records, and assigning digital object identifiers. Similar data curation activities take place at many data repositories and archives.

These activities are understood as essential for ensuring and enhancing data quality. Dryad, for example, states that its curatorial team “works to enforce quality control on existing content.” But there are many ways to assess the quality of data. One criterion is verity: Whether the data reflect actual facts, responses, observations or events. This is often assessed by the existence and completeness of metadata. The UK’s Economic and Social Research Council (ESRC), for example, requests documentation of “the calibration of instruments, the collection of duplicate samples, data entry methods, data entry validation techniques, methods of transcription.” Another way to assess data quality is by its degree of openness. Shannon Bohle recently listed no less than eight different standards for assessing the quality of open data on this dimension. Others argue that data quality consists of a mix of technical and content criteria that all need to be taken into account. Wang & Strong’s 1996 article claims that, “high-quality data should be intrinsically good, contextually appropriate for the task, clearly represented, and accessible to the data consumer.” More recently, Kevin Ashley observed that quality standards may be at odds with each other. For example, some users may prize the completeness of the data while others their timeliness. These standards can go a long way toward ensuring that data are accurate, complete, and timely and that they are delivered in a way that maximizes their use and reuse.

Yet these procedures are “rather formal and do not guarantee the validity of the content of the dataset” (Doorn et al). Leaving aside the question of whether they are always adhered to, these quality standards are insufficient when viewed through the lens of “really reproducible research.” Reproducible science requires that data and code be made available alongside the results, to allow regeneration of the published results. For a replication archive, such as the ISPS Data Archive, the reproducibility standard is imperative.

Data Review

The imperative to provide data and code, however, only achieves the potential for verification of published results. It remains unclear as to how actual replication occurs. That’s where a comprehensive definition of the concept of “data review” can be useful: At ISPS, we understand data review to mean taking that extra step – examining the data and code received for deposit and verifying and replicating the original research outputs.

In a recent talk, Christine Borgman pointed out that most repositories and archives follow the letter, not the spirit, of the law. They take steps to share data, but they do not review the data. “Who certifies the data? Gives it some sort of imprimatur?” she asks. This theme resonated at Open Repositories. Stodden asked: “Who, if anyone, checks replication pre-publication?” Chuck Humphrey lamented the lack of an adequate data curation toolkit and best practices regarding the extent of data processing prior to ingest. And Guédon argued that repositories have a key role to play in bringing quality to the foreground in the management of science.

Stodden’s call for the provision of data and code underlying publication echoes Gary King’s 1995 definition of the “replication standard” as the provision of, “sufficient information… with which to understand, evaluate, and build upon a prior work if a third party could replicate the results without any additional information from the author.” Both call on the scientific community to take up replication for the good of science as a matter of course in their scientific work. However, both are vague as to how this can be accomplished. Stodden suggested at Open Repositories that this activity is community-dependent, often done by students or by other researchers continuing a project, and that community norms can be adjusted by rewarding high integrity, verifiable research. King, on the other hand, argues that “the replication standard does not actually require anyone to replicate the results of an article or book. It only requires sufficient information to be provided – in the article or book or in some other publicly accessible form – so that the results could in principle be replicated” (emphasis added in italics). Yet, if we care about data quality, reproducibility, and credibility, it seems to me that this is exactly the kind of review in which we should be engaging.

A quick survey of various stakeholders in the research data lifecycle reveals that data review of this sort is not widely practiced:

  • Researchers, on the whole, do not do replication tests as part of their own work, or even as part of the peer review process. In the future, they may be incentives for researchers to do so, and post-publication crowd-sourced peer review in the mold of Wikipedia, as promoted by Edward Curry, may prove to be a successful model.
  • Academic institutions, and their libraries, are increasingly involved in the data management process, but are not involved in replication as a matter of course (note some calls for libraries to take a more active role in this regard).
  • Large or general data repositories like Dryad, FigShare, Dataverse, and ICPSR provide useful guidelines and support varying degrees of file inspection, as well as make it significantly easier to include materials alongside the data, but they do not replicate analyses for the purpose of validating published results. Efforts to encourage compliance with (some of) these standards (e.g., Data Seal of Approval) typically regard researchers responsible for data quality, and generally leave repositories to self-regulate.
  • Innovative services, such as RunMyCode, offer a dissemination platform for the necessary pieces required to submit the research to scrutiny by fellow scientists, allowing researchers, editors, and referees to “replicate scientific results and to demonstrate their robustness.” RunMyCode is an excellent facilitator for people who wish to have their data and code validated; but it relies on crowd sourcing, and does not provide the service per se.
  • Some argue that scholarly journals should take an active role in data review, but this view is controversial. A document produced by the British Library recently recommended that, “publishers should provide simple and, where appropriate, discipline-specific data review (technical and scientific) checklists as basic guidance for reviewers.” In some disciplines, reviewers do check the data. The F1000 group identifies the “complexity of the relationship between the data/article peer review conducted by our journal and the varying levels of data curation conducted by different data repositories.” The group provides detailed guidelines for authors on what is expected of them to submit and ensures that everything is submitted and all checklists are completed. It is not clear, however, if they themselves review the data to make sure it replicates results. Alan Dafoe, a political scientist at Yale, calls for better replication practices in political science. He places responsibility on authors to provide quality replication files, but then also suggests that journals encourage high standards for replication files and that they conduct a “replication audit” which will “evaluate the replicability and robustness of a random subset of publications from the journal.”

The ISPS Data Archive and Reproducible Research

This brings us to the ISPS Data Archive. As a small, on-the-ground, specialized data repository, we are dedicated to serious data review. All data and code – as well as all accompanying files – that are made public via the Archive are closely reviewed and adhere to standards of quality that include verity, openness, and replication. In practice it means that we have developed curatorial practices that include assessing whether the files underlying a published (or soon to be published) article, and provided by the researchers, actually reproduce the published results.

This requires significant investment in staffing, relationships, and resources. The ISPS Data Archive staff has data management and archival skills, as well as domain and statistical expertise. We invest in relationships with researchers and learn about their research interests and methods to facilitate communication and trust. All this requires the right combination of domain, technical and interpersonal skills as well as more time, which translates into higher costs.

How do we justify this investment? Broadly speaking, we believe that stewardship of data in the context of “really reproducible research” dictates this type of data review. More specifically, we think this approach provides better quality, better science, and better service.

  • Better quality. By reviewing all data and code files and validating the published results, the ISPS Data Archive essentially certifies that all its research outputs are held to a high standard. Users are assured that code and data underlying publications are valid, accessible, and usable.
  • Better science. Organizing data around publications advances science because it helps root out error. “Without access to the data and computer code that underlie scientific discoveries, published findings are all but impossible to verify” (Stodden et al.) Joining the publication to the data and code combats the disaggregation of information in science associated with open access to data and to publications on the Web. In effect, the data review process is a first order data reuse case: The use of research data for research activity or purpose other than that for which it was intended. This places the Archive as an active partner in the scientific process as it performs a sort of “internal validity” check on the data and analysis (i.e., do these data and this code actually produce these results?).

    It’s important to note that the ISPS Data Archive is not reviewing or assessing the quality of the research itself. It is not engaged in questions such as, was this the right analysis for this research question? Are there better data? Did the researchers correctly interpret the results? We consider this aspect of data review to be an “external validity” check and one which the Archive staff is not in a position to assess. This we leave to the scientific community and to peer review. Our focus is on verifying the results by replicating the analysis and on making the data and code usable and useful.

  • Better service. The ISPS Data Archive provides high level, boutique service to our researchers. We can think of a continuum of data curation that progresses from a basic level where data are accepted “as is” for the purpose of storage and discovery, to a higher level of curation which includes processing for preservation, improved usability, and compliance, to an even higher level of curation which also undertakes the verification of published results.

This model may not be applicable to other contexts. A larger lab, greater volume of research, or simply more data will require greater resources and may prove this level of curation untenable. Further, the reproducibility imperative does not neatly apply to more generalized data, or to data that is not tied to publications. Such data would be handled somewhat differently, possibly with less labor-intensive processes. ISPS will need to consider accommodating such scenarios and the trade-offs a more flexible approach no doubt involves.

For those of us who care about research data sharing and preservation, the recent interest in the idea of a “data review” is a very good sign. We are a long way from having all the policies, technologies, and long-term models figured out. But a conversation about reviewing the data we put in repositories is a sign of maturity in the scholarly community – a recognition that simply sharing data is necessary, but not sufficient, when held up to the standards of reproducible research.

OR2013: Open Repositories Confront Research Data

Open Repositories 2013 was hosted by the University of Prince Edward Island from July 8-12. A strong research data stream ran throughout this conference, which was attended by over 300 participants from around the globe.  To my delight, many IASSISTers were in attendance, including the current IASSIST President and four Past-Presidents!  Rarely do such sightings happen outside an IASSIST conference.

This was my first Open Repositories conference and after the cool reception that research data received at the SPARC IR meetings in Baltimore a few years ago, I was unsure how data would be treated at this conference.  I was pleasantly surprised by the enthusiastic interest of this community toward research data.  It helped that there were many IASSISTers present but the interest in research data was beyond that of just our community.  This conference truly found an appropriate intersection between the communities of social science data and open repositories. 

Thanks go to Robin Rice (IASSIST), Angus Whyte (DCC), and Kathleen Shearer (COAR) for organizing a workshop entitled, “Institutional Repositories Dealing with Data: What a difference a ‘D’ makes!”  Michael Witt, Courtney Matthews, and I joined these three organizers to address a range of issues that research data pose for those operating repositories.  The registration for this workshop was capped at 40 because of our desire to host six discussion tables of approximately seven participants each.  The workshop was fully subscribed and Kathleen counted over 50 participants prior to the coffee break.  The number clearly expresses the wider interest in research data at OR2013.

Our workshop helped set the stage for other sessions during the week.  For example, we talked about environmental drivers popularizing interest in research data, including topics around academic integrity.  Regarding this specific issue, we noted that the focus is typically directed toward specific publication-related datasets and the access needed to support the reproducibility of published research findings.  Both the opening and closing plenary speakers addressed aspects of academic integrity and the role of repositories in supporting the reproducibility of research findings.  Victoria Stodden, the opening plenary speaker, presented a compelling and articulate case for access to both the data and computer code upon which published findings are based.  She calls herself a computational scientist and defends the need to preserve computer code as well as data to facilitate the reproducibility of scientific findings.  Jean-Claude Guédon, the closing plenary speaker, bracketed this discussion on academic integrity.  He spoke about scholarly publishing and how the commercial drive toward indicators of excellence has resulted in cheating.  He likened some academics to Lance Armstrong, cheating to become number one.  He feels that quality rather than excellence is a better indicator of scientific success.

Between these two stimulating plenary speakers, there was a number of sessions during which research data were discussed.  I was particularly interested in a panel of six entitled, “Research Data and Repositories,” especially because the speakers were from the repository community instead of the data community.  They each took turns responding to questions about what their repositories do now regarding research data and what they see happening in the future.  In a nutshell, their answers tended to describe the desire to make better connections between the publications in their repositories with the data underpinning the findings in these articles.  They also spoke about the need to support more stages of the research lifecycle, which often involves aspects of the data lifecycle within research.  There were also statements that reinforced the need for our (IASSIST’s) continued interaction with the repository community.  The use of readme files in the absence of standards-based metadata and other practices, where our data community has moved the best-practice yardstick well beyond, demonstrate the need for our communities to continue in dialogue. 

Chuck Humphrey

In search of: Best practice for code repositories?

I was asked by a colleague about organized efforts within the economics community to develop or support repositories of code for research.  Her experience was with the astrophysics world which apparently has several and she was wondering what could be learned from another academic community.  So I asked a non-random sample of technical economists with whom I work, and then expanded the question to cover all of social sciences and posed the question to the IASSIST community. 

In a nutshell, the answer seems to be “nope, nothing organized across the profession” – even with the profession very broadly defined.  The general consensus for both the economics world and the more general social science community was that there was some chaos mixed with a little schizophrenia. I was told there are there are instances of such repositories, but they were described to me as “isolated attempts” such as this one by Volker Wieland:  http://www.macromodelbase.com/.  Some folks mentioned repositories that were package or language based such as R modules or SAS code from the SAS-L list or online at sascommunity.org.

Many people pointed out that there are more repositories being associated with journals so that authors can (or are required to) submit their data and code when submitting a paper for publication. Several responses touched on this issue of replication, which is the impetus for most journal requirements, including one that pointed out a “replication archive” at Yale (http://isps.yale.edu/research/data).  I was also pointed to an interested paper that questions whether such archives promote replicable research (http://www.pages.drexel.edu/~bdm25/cje.pdf) but that’s a discussion for another post.

By far, the most common reference I received was for the repositories associated with RePEc (Research Papers in Economics) which offers a broad range of services to the economic research community.  There you’ll find the IDEAS site (http://ideas.repec.org/) and the QM&RBC site with code for Dynamic General Equilibrium models (http://dge.repec.org/) both run by the St. Louis Fed.

I also heard from support folks who had tried to build a code repository for their departments and were disappointed by the lack of enthusiasm for the project. The general consensus is that economists would love to leverage other people’s code but don’t want to give away their proprietary models.  They should know there is no such thing as a free lunch! 

 I did hear that project specific repositories were found to be useful but I think of those as collaboration tools rather than a dissemination platform.  That said, one economist did end his email to me with the following plea:  “lots of authors provide code on their websites, but there is no authoritative host. Will you start one please?”

/san/

Data-related blog posts coming out of Open Repositories 2012 conference

I 'd been meaning to write an IASSIST blog post about OR 2012, hosted by the University of Edinburgh's Host Organising Committee led by Co-Chair and IASSISTer Stuart Macdonald in July, because it had such good DATA content.

Fortunately Simon Hodson, the UK's JISC Managing Research Data Programme Manager, has provided this introduction and has allowed me to post it here, with further links to his analytic blog posts, and even those contain further links to OTHER blog posts talking about OR2012 and data!

There are also more relevant pointers from the OR 2012 home page here: http://or2012.ed.ac.uk/2012/08/20/another-round-of-highlights/

I think there's enough here to easily keep people going until next year's conference in Prince Edward Island next July. Oh, and Peter Burnhill, Past President IASSIST, made a good plug for IASSIST in his closing keynote, pointing it out to repository professionals as a source of expertise and community for would-be data professionals.

Enjoy! - Robin Rice, University of Edinburgh

---Forward----

It has been widely remarked that OR 2012 saw the arrival of research data in the repository world.  Using a wordle of #or2012 tweets in his closing summary, Peter Burnhill noted that ‘Data is the big arrival. There is a sense in which data is now mainstream.’  (See Peter’s summary on the OR2012 You Tube Channel: http://www.youtube.com/watch?v=0jQRDWq-dhc&feature=plcp).

I have written a series of blog posts reflecting on the contributions made by *some* those working on research data repositories, and particularly the development of research data services http://or2012.ed.ac.uk/2012/08/20/another-round-of-highlights/.

These posts may be of interest to subscribers to this list and are listed below.

Institutional Data Repositories and the Curation Hierarchy: reflections on the DCC-ICPSR workshop at OR2012 and the Royal Society’s Science as an Open Enterprise report
http://researchdata.jiscinvolve.org/wp/2012/08/06/institutional-data-repositories-and-the-curation-hierarchy-reflections-on-the-dcc-icpsr-workshop-at-or2012-and-the-royal-societys-science-as-an-open-enterprise-report/

‘Data is now Mainstream’: Research Data Projects at OR2012 (Part 1…)
http://researchdata.jiscinvolve.org/wp/2012/08/13/data-is-now-mainstream-research-data-projects-at-or2012-part-1/

Pulling it all Together: Research Data Projects at OR2012 (Part 2…)
http://researchdata.jiscinvolve.org/wp/2012/08/14/pulling-it-all-together-research-data-projects-at-or2012-part-2/

Making the most of institutional data assets: Research Data Projects at OR2012 (Part 3…)
http://researchdata.jiscinvolve.org/wp/2012/08/15/making-the-most-of-institutional-data-assets-research-data-projects-at-or2012-part-3/

Manage locally, discover (inter-)nationally: research data management lessons from Australia at OR2012
http://researchdata.jiscinvolve.org/wp/2012/08/16/manage-locally-discover-inter-nationally-research-data-management-lessons-from-australia-at-or2012/

Simon Hodson [reposted with permission]

  • IASSIST Quarterly

    Publications Special issue: A pioneer data librarian
    Welcome to the special volume of the IASSIST Quarterly (IQ (37):1-4, 2013). This special issue started as exchange of ideas between Libbie Stephenson and Margaret Adams to collect

    more...

  • Resources

    Resources

    A space for IASSIST members to share professional resources useful to them in their daily work. Also the IASSIST Jobs Repository for an archive of data-related position descriptions. more...

  • community

    • LinkedIn
    • Facebook
    • Twitter

    Find out what IASSISTers are doing in the field and explore other avenues of presentation, communication and discussion via social networking and related online social spaces. more...