Already a member?

Sign In

iBlog

The Practice of User Registration to Access Data

I am writing to address an issue that is coming up at UCLA with increasing frequency. More and more data distributors (agencies, archives, individuals) require a user to "register" before downloading data from a web site. These registrations take many forms and there is a continuum in terms of levels of detail and responsibility assigned to the user. It makes it hard for me to be of service to users when each study requires every individual to perform the registration task. Few archives will agree to any kind of blanket license so that the Archive can download the data one time and provide it to users at UCLA. Fewer still will permit me to be a gate keeper such as ICPSR acts for really secure data. I'd like to know how other archives handle these issues.

Faculty chafe at the registration requirement and I am aware of numerous cases where one faculty member gets the file(s) and "shares" regardless of the agreement they have made with the data distributor not to do so. There is also a feeling on the part of faculty that increasing access barriers are adding to the cost of research. I am intrigued by this idea and I wonder if anyone has heard this argument as well. I would like to find data that can demonstrate this increase in cost - whether it is in time, staff costs, hardware or software costs, or other quantifiable aspects of research. Any suggestions would be welcome.

Contributed by Libbie Stephenson

Comments

An interesting piece related

An interesting piece related to access but from a different perspective than we usually deal with: Not sure if the url will work so the text (w/o formating and graphics) is pasted below. http://www.nature.com/nature/journal/v439/n7072/full/439006a.html Nature 439, 6-7 (5 January 2006) | doi:10.1038/439006a Mashups mix data into global service Declan Butler Top of page Abstract Is this the future for scientific analysis? Will 2006 be the year of the mashup? Originally used to describe the mixing together of musical tracks, the term now refers to websites that weave data from different sources into a new service. They are becoming increasingly popular, especially for plotting data on maps, covering anything from cafés offering wireless Internet access to traffic conditions. And advocates say they could fundamentally change many areas of science — if researchers can be persuaded to share their data. Some disciplines already have software that allows data from different sources to be combined seamlessly. For example, a bioinformatician can get a gene sequence from the GenBank database, its homologues using the BLAST alignment service, and the resulting protein structures from the Swiss-Model site in one step. And an astronomer can automatically collate all available data for an object, taken by different telescopes at various wavelengths, into one place, rather than having to check each source individually. So far, only researchers with advanced programming skills, working in fields organized enough to have data online and tagged appropriately, have been able to do this. But simpler computer languages and tools are helping. Google's maps database, for example, allows users to integrate data into it using just ten lines of code (http://www.google.com/apis/maps). UniProt, the world's largest protein database, is developing its existing public interfaces to protein sequence data to encourage outside users to access and reuse its data. The biodiversity community is one group working to develop such services. To demonstrate the principle, Roderic Page of the University of Glasgow, UK, built what he describes as a "toy" — a mashup called Ispecies.org (http://darwin.zoology.gla.ac.uk/~rpage/ispecies). If you type in a species name it builds a web page for it showing sequence data from GenBank, literature from Google Scholar and photos from a Yahoo image search. If you could pool data from every museum or lab in the world, "you could do amazing things", says Page. Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, or to obtain a text description, please contact npg@nature.com M. DOHRN/K. TAYLOR/NATUREPL/NASA Web crawling: ant researchers are bringing together information from a variety of sources. Donat Agosti of the Natural History Museum in Bern, Switzerland, is working on this. He is one of the driving forces behind AntBase and AntWeb, which bring together data on some 12,000 ant species. He hopes that, as well as searching, people will reuse the data to create phylogenetic trees or models of geographic distribution. This would provide the means for a real-time, worldwide collaboration of systematicists, says Norman Johnson, an entomologist at Ohio State University in Columbus. "It has the potential to fundamentally change and improve the way that basic systematic research is conducted." A major limiting factor is the availability of data in formats that computers can manipulate. To develop AntWeb further, Agosti aims to convert 4,000 papers into machine-readable online descriptions. Another problem is the reluctance of many labs and agencies to share data. But this is changing. A spokesman for the Global Health Atlas from the World Health Organization (WHO), for example, a huge infectious-disease database, says there are plans to make access easier. The Global Biodiversity Information Facility (GBIF) has linked up more than 80 million records in nearly 600 databases in 31 countries. And last month saw the launch of the International Neuroinformatics Coordinating Facility. But such initiatives are hampered by restrictive data-access agreements. The museums and labs that provide the GBIF with data, for example, often require outside researchers to sign online agreements to download individual data sets, making real-time computing of data from multiple sources almost impossible. Nature has created its own mashup, which integrates data on avian-flu outbreaks from the WHO and the UN Food and Agriculture Organization into Google Earth (http://www.nature.com/nature/googleearth/avianflu1.kml). The result is a useful snapshot, but illustrates the problem. As the data are not in public databases that can be directly accessed by software, we had to request them from the relevant agencies, construct a database and compute them into Google Earth. If the data were available in a machine-readable format, the mashup could search the databases automatically and update the maps as outbreaks occur. Other researchers could also mix the data with their own data sets. Page and Agosti hope that researchers will soon become more enthusiastic about sharing. "Once scientists see the value of freeing-up data, mashups will

Many thanks again to everyone

Many thanks again to everyone who has written to me about user registration. This is sort of to follow up on what you all sent and to reflect a bit on the comments from Chuck Humphrey, Ernie Boyko and Wendy Watkins. These are fantastic comments! I have been using what people sent me to work with my campus site license administrators in the library and computer center. It turns out that licensing procedures for computing centers are quite different than for libraries. And with data files one has to do a little of both. The negotiations about access as well as long term holdings or preservation are varied to say the least. Every vendor has their own take on the number of people who can access something at the same time, whether or not there can be on site backups and how long one can keep the old stuff. The size of the university, whether or not it grants PhD.'s the physical location of the material, and so on, also gets factored in. No wonder they have people whose full time job revolves around just these issues. From talking to the libraries I see that this is a concern in terms of journal publications that are now distributed in electronic form. Distributors do not necessarily have a long term plan and they do not always allow libraries to make backups - what will happen if someone wants an article published electronically 5 years ago? Even if Google finds it in a search engine will it still be there? Chuck's point about the short term life of "data" is a really serious issue as applied to publications. Meanwhile, has anyone yet taken a look at the new project ICPSR will participate in on certifying digital repositories? I am copying what Ann Green sent out a few weeks ago below. I went through the checklist for certification and despite all that I do to ensure that what we have that is unique at UCLA continues to be accessible, my facility doesn't even come close to meeting the criteria they specify. So what does this mean to the facilities that acquire data for research and instruction and act as holding sites for data collected by local faculty, or from agencies with no archival activities? There is no way I am ever going to get the resources to do what the RLG checklist specifies and yet we have never lost anything in 25 years of archiving. Now that I am starting to actually read all those brochures about retirement it occurs to me that if I go away it is not very likely that the materials now in the collection will have the same care or attention. Are others facing this too? Should I just give it all to ICPSR? If we all decide to do that then where can ICPSR get the resources to process everything fully with complete DDI codebooks and so forth so they can qualify as a digital repository? If the data producers who require the user registration don't really have preservation procedures in place then what should be our role? Should we just get copies of data anyway and ignore the registration? In the past it has helped that more than one facility had a copy of a file or a codebook, so is a centralized place like ICPSR really the best model? Some duplication might be a good thing. or should there be archives specializing in just certain types of holdings? Like the Roper Center and public opinion polls. One thing that helps is a historical perspective. Some time ago there were no real standards for codebooks and with effort on our part data producers got the message. They also mostly got the idea that providing accurate bibliographic details and requiring "funded" projects to be "archived" was a good thing. So maybe we now need to work on the preservation issue; perhaps this RLG proposal is one way to get it going.
RLG has just released a draft report for the certification of digital repositories. The draft, titled "An Audit Checklist for the Certification of Trusted Digital Repositories," is available at http://www.rlg.org/en/page.php?Page_ID=20769. It is the product of a task force working on a joint project between RLG and the National Archives and Records Administration (NARA). The goal of the RLG-NARA Digital Repository Certification project has been to identify the criteria repositories must meet for reliably storing, migrating, and providing access to digital collections. The "Audit Checklist" identifies procedures for certifying digital repositories. Leveraging the RLG-NARA checklist, the Center for Research Libraries (CRL) Audit and Certification of Digital Archives project will test audit the Koninklijke Bibliotheek National Library of the Netherlands), which maintains the digital archive for Elsevier Science Direct Journals, the Inter-University Consortium for Political and Social Research (ICPSR), and Portico, an archive for electronic journals incubated within Ithaka Harbors, Inc. Stanford's LOCKSS system will also participate in this effort. Robin Dale, manager of both projects, says: "We look forward to receiving comments on the draft and to hearing the response from the community." Comments on the draft are due before mid-January 2006 to Robin.Dale@rlg.org (+1-650-691-2238). For more about the RLG-NARA task force, see http://www.rlg.org/en/page.php?Page_ID=5441
Libbie Stephenson September 14, 2005

Hi, To add to both comments,

Hi, To add to both comments, even the most well-intended data provider may not have the wherewithall to protect data. We are currently dealing with an NGO who commissions several public opinion polls every year and would dearly love to have them made available to students and researchers. Unfortunately, they lack both the personnel and the expertise to do the job. We are now trying to piece together bits and pieces of methodological and other metadata to rescue some of the older polls, and discern which version of the data is the correct one. We have arranged for them to deposit their new polls as soon as the data are received. I'm sure this is far from a unique example. Luckily there was a personal connection in this case that should result in some of the older data being successfully rescued. The problem is that serendipity isn't a practical means of data preservation. Wendy Watkins September 14, 2005

On Tue, 13 Sep 2005, Laine

On Tue, 13 Sep 2005, Laine Ruus wrote: > how many of those data producers (whether they be commercial > companies, research institutes, or term-funded research projects) > who are 'marketing' directly to the end user only will still be > around in 10 years, in 20 years, in 50 years? What plans do they > have for archiving the data once they cease to exist? I would like to continue Laine's argument by adding that a commercial data producer doesn't have to disappear in 10, 20 or 50 years for a data product to be lost. We had an excellent example of this on the IASSIST discussion list earlier this month. Tanvi sent the following message enquiring about a product from a commercial data producer:
I have a researcher who is trying to get hold of Dunn and Bradstreet (ownership of US companies) data from the 1990s. We have contacted the data providers and they do not keep an archive of old data, so I'm hoping someone out there has an electronic coping buried in a server somewhere.
I am encountering more frequently an attitude that data have a shelf life, i.e., a best-before or expiry date, after which they should be discarded. This attitude seems particularly established among commercial data producers, who are primarily interested in selling you the most recent data. I have also found this attitude with a couple of editors of academic health journals. In working with some researchers on our campus, I have seen letters from editors, and in some instances peer reviewers, saying that the data were a bit old in the article. "Couldn't newer data be used?" I can possibly see this having validity in descriptive research in which the prominence of a current problem is being assessed. However in a comparative or correlational analysis where the relationships among variables are the focus, the time period in which the data were produced should be less of a concern. These editors seem to have confused "knowledge" with "news". In addition to this attitude that data can become stale, there is the more overtly hostile position taken by some institutional review boards (IRB) requiring the destruction of data after a specific date. This practice has roots in clincial trial research where the motives for destroying the data likely involve concerns over and above human subject confidentiality. In Canada, and hopefully elsewhere, the IRB practice of requiring data to be destroyed is being challenged and ethical guidelines are being reviewed to allow other ways of ensuring subject confidentiality without the destruction of data. The bottom line is that data should be safe from Father Time. Chuck Humphrey September 13, 2005

A wonderful discussion with

A wonderful discussion with lots of excellent insight. I would just like to reiterate some of the points made by Melanie, although not identical, we see many of the same issues in Canada. Without a doubt, some of these licencing arrangements are considered a cost by the researcher and they can be a barrier. ICPSR registration is fairly simple and it opens the door to a whole collection of information, and it is correctly pointed out that this is not a significant constraint. However, that is not always the case - there are many different ways to register, potentially using unique id's and passwords and once you get through that there may also be a unique extraction tool. That is fine if you are only shopping at one place, but that isn't always the case. As Libbie points it, this can also make things much more difficult for us in terms of supporting our users and I think the fact that faculty distribute after the fact shows the cost outweighs the perceived benefit. In Canada, the one that bothers me is around the licencing of geospatial data. Again, it varies greatly and we are making progress, but in many cases we still need researchers to sign off paper copies to use the data - we haven't yet convinced the powers to be that digital signatures or authentication should be good enough. I think the point expressed by Ilona (through Libbie) about needing to catch up on the policy front is a good one, but the need varies across providers and even across countries. Bo Wandschneider September 13, 2005

There is another concern

There is another concern here, to me at least, and that is the question of long-term preservation of the data. ICPSR, the Essex Archive, etc have a proven track record in long-term data preservation, but how many of those data producers (whether they be commercial companies, research institutes, or term-funded research projects) who are 'marketing' directly to the end user only will still be around in 10 years, in 20 years, in 50 years? What plans do they have for archiving the data once they cease to exist? My guess is, none. Unfortunately, with the current climate of ethnics review requirements in Canada, a number of unique data resources will probably no longer exist in 10 or 20 years, because they have no 'succession plan' and insist on dealing only with the researcher. Denise Lievsley, a number of years ago at IASSIST, enumerated a list of about 10 services data archives provide by virtue of their very existence. One of them was long-term preservation of data produced by organizations with a short shelf-life. This cannot happen if the data are provided only to the individual researcher. Laine Ruus September 13, 2005

Under a principle/policy that

Under a principle/policy that 'data is free at the point of use', this still leaves the possibility that some sources of data will involve payment - usually an insitutional subscription/agreement. Even if the subscription is free - ie requires no payment, this will usually involve agreement to some terms and conditions, including that access is limited to members of that institution - and if access is to be easy (making use of web and Internet), some form of authentication & authorisation arrangement is required. Authentication is the way of ensuring that the (potential) user is indeed a member of the authorised institution. The system being explored internationally to support this is Sibboletth, which came out of Internet2. A number of organisations, including EDINA, MIMAS and UKDA here in the UK, are working together to convert their own systems to make use of Shibboleth. Additional information, including some very techhie info, can be had from http://shibboleth.internet2.edu/ and http://edina.ac.uk/projects/#infrastructure Shibboleth also featured at the IASSIST/IFDO Conference at Edinburgh this year: E3: Enlightening Access Control: New Methods
Issues in federated identity management
Sandy Shaw (EDINA, University of Edinburgh)

Shibbolising UK Census and ESDS services
Lucy Bell (UK Data Archive, University of Essex)
The presentations are available for download from http://www.iassistdata.org/conferences/2005/presentations/ One of the key points to note is that Shibboleth is based on individuals belonging to institutions. This differs from the individual/retail model implied by some MyAccount schemes. A completely separate issue is whether these authenticated & authorised users have to declare information about themsleves or have their use tracked. Peter Burnhill September 13, 2005

Speaking as an 'outsider' on

Speaking as an 'outsider' on this subject: what Mary says makes perfect sense to me. As a user, I never mind providing information at this level. What I don't like is being expected to pay for data! Patty Becker September 12, 2005

To add to the interesting

To add to the interesting discussion that Libbie initiated regarding user registration requirements: At ICPSR, we require that anyone downloading data from us register with the MyData service. Completing the one-time registration form takes at most five minutes. Subsequent sessions require that users supply only a userid and password, similar to the Yahoo and Amazon models of authentication. This type of registration serves multiple purposes for ICPSR: (1) It helps us track data usage, enabling us to provide evidence to funders that their investment in ICPSR is worthwhile. ICPSR receives funding from many sources beyond the membership, each of which requires usage statistics of some form. Complying with these funders' requests permits us to continue to build the archive to serve the needs of our community. (2) It helps us determine that the user really is who he or she claims to be, thus fulfilling our responsibility to data depositors. (3) It helps us design other services to benefit users, e.g., order history tracking, updates when collections previously downloaded are revised, notification that data of interest have been released, etc. (4) It improves our ability to fulfill requests from member institutions about who on their campus is using ICPSR resources, and to communicate directly with data users when that is appropriate (users have the option to opt out of being contacted, however). From our perspective, at this point the benefits of registration far outweigh the user burden, which seems fairly light. It's always good to hear about these types of concerns, though. Perhaps as Libbie mentioned we can continue this discussion through a session at the upcoming IASSIST conference in Ann Arbor. Mary Vardigan September 12, 2005

Hey Melanie, My thoughts

Hey Melanie, My thoughts exactly. Bravo. On the other hand, the cost of setting up a registration system and seeing to the administration behind it is a cost that is absorbed by the Archive on behalf of the researcher. So while the user may not "pay" anything, there are costs involved. I am in a relatively small unit and normally I rely on the larger archives such as UKDATA and ICPSR to develop the technology and the administration. My thought is that my ICPSR membership fees include some of this kind of support. When a faculty member wants data from an archive outside of that umbrella, then I find myself caught between wanting to make data access as smooth and easy as possible, and needing to meet the requirements of the data distributor. There is another wrinkle here too. The faculty are asking the question: Why does each and every user have to register and download their own copy? And again, I second your comments about protection for the respondents. I have found that I can make arrangements with archives to download one copy and make it available to each individual but only if they register first with the distributing archive. Not all data distributors will accept this plan because I do not have the infrastructure they want to monitor the use of the data. In those situations we refer the user to a secure data center service. But again, this has more bureaucracy for the user and administrative costs locally. When that happens, then again, I go to the larger organizations; they are in a better position to negotiate these kinds of arrangements. I was talking to Ilona Einowski about this and I think it is ok to paraphrase her thoughts here. It seems to be the case that technology has moved along at a pace that is faster than our work on setting policies and procedures. We need to be thinking about these issues broadly, as well as in terms of how to implement data access locally. We might draw on the experiences of libraries and their work to negotiate licenses for digital materials. It might also be helpful to explore policies that academic institutions have for file sharing. Is there an IASSIST blog on this topic? Would this be a good topic to propose for the IASSIST meeting next may? Again, I would be interested in any responses people have and thanks to everyone who has so far shared their thoughts. Libbie Stephenson September 9, 2005

I think our users are more

I think our users are more bothered by the IRB application process than the fact that some sites require registration before data access is granted. Data access registration should be a 'one time' form that allows simple access from that point forward. ICPSR does a good job of this as do most of the commercial vendors I use for transactions (airlines, etc.). The University of Michigan does not consider public use data exempt from the IRB review processs. Thus, for any research, the user has to fill out a pretty extensive form to get an exemption. Luckily, the form can be amended once one decides to use an additional file to study xxxxxxxxxxxxxx. The skip pattern on the form is such that the question about 'secondary data' comes late in the process. Thus, the researcher has to answer questions about
can the respondent quit the survey?
are there pregnant women in the survey?
etc.
etc.
Back to Libbie's comment, small libraries without an abundance of technical staff, will find it difficult to meet the requirements of data producers and yet satisify their customers. Add Health allows us to re-distribute the public use version of the data within our user base. However, we have to keep a list of users. To make it easy for the user, you need to do it electronically. Not all small operations will be able to pull this off. Lisa Neidert September 9, 2005

Well said! Keith Cole

Well said! Keith Cole September 9, 2005

Well, I think the situation

Well, I think the situation in Europe is a little different, because there is much less of an overarching ethic that data collected at publc expense should be freely available. Registration is neccessary for our users because our depositors (the data creators) want to have legal contracts with users wherein users promise not to misuse data, not to identify individuals, not to pass data on, not to use it for commercial purposes without explicit permission and so forth. The government departments simply would not (and under the legal consent framework for survey respondents could not) release these data without these legal safeguards. Our response is to make the registration process as simple and online as possible, and to bring as many resources as we can under a single registration system, so it only has to be done once for all national social science resources. Personally, I think it is a good tradeoff which for a little effort enables academic access to data they wouldn't otherwise be able to get. With current computing power and data linkage techniques, foolproof anonymisation is for most surveys of any depth or complexity an impossibility. If user registration is what it takes to keep these data available for research, it seems a small price to pay. Just my 2p Melanie Wright September 9, 2005

Glad you brought this up

Glad you brought this up Libbie, I've been thinking about this too. The other issue I'd add to this is user privacy (another kind of cost), about which libraries (my setting) are normally very careful. Virtually none of our other databases to which we subscribe require user registration / tracking of activities of individual users, and we are very reluctant for vendors to track their individual information use. However, one archive of which we are a member has instituted a user registration system; when I expressed concern about it they said it was for user convenience, in order to enable users to track activities across sessions; however, my response was that this should be an optional feature that users can select (making their own decision to sacrifice privacy for convenience) rather than a requirement. However, in response to your last question, I wouldn't know how to quantify this issue (and have not heard the complaint from faculty you mention, but that doesn't mean they're not thinking it). Hopefully the data archives reps. on this list are thinking about these issues and taking our feedback! Kate McNeill-Harman September 9, 2005

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
  • Iassist Quarterly

    Publications Welcome to the special double issue 3 & 4 of the IASSIST Quarterly (IQ) volume 36 (2012). This special issue addresses the organizational dimension of digital preservation as it was presented and discussed at the IASSIST conference in May 2013 in Cologne, Germany.

    more...

  • Resources

    Resources

    A space for IASSIST members to share professional resources useful to them in their daily work. Also the IASSIST Jobs Repository for an archive of data-related position descriptions. more...

  • community

    • LinkedIn
    • Facebook
    • Twitter

    Find out what IASSISTers are doing in the field and explore other avenues of presentation, communication and discussion via social networking and related online social spaces. more...