Technology such as SPSS and SAS used for retrieving and manipulating data has remained virtually unchanged since the first statistical packages were introduced almost four decades ago. In recent years, however, the scale and complexity of population and health data has expanded rapidly, and reliance on antiquated data management tools imposes heavy costs on the research community. This presentation will discuss new approaches to managing complex data and describe new technologies that will allow researchers to integrate and restructure complex data without custom programming.
2008-05-28: A1: DDI And Related Tools: Next Generation Tools For Converting, Displaying And Visualizing Data
2008-05-28: A2: Describing Data and Data Use
Best Practices Documents - Are They Really Necessary?
Michelle Edwards (University of Guelph)
Jane Fry (Carleton University)
Alexandra Cooper (Queen's University)
In Ontario, there is a movement afoot to mark up surveys in DDI and put them in an interface that allows them to be shared with other universities. A noble exercise, indeed! Our project, (Ontario Data Documentation, Extraction Service and Infrastructure Initiative) provides university researchers with unprecedented access to a significant number of datasets in a web-based data extraction system. Access to the data with its accompanying standardized metadata is key to our project. However, the staff marking up these surveys do not necessarily think alike, so the formats used in marking up the surveys can and do vary across institutions. And this is taking place in only one province so this begs the question of what the formatting looks like when the marking up is done nationally. In this presentation, we will discuss the five Ws of a Best Practices Document: why we need one; when it happened; where it was put together; what the process was; and who will benefit from it.
Assessing the Scientific Benefits of Interdisciplinary Use of Social Science Data through Citation Analysis
Robert S. Chen (CIESIN)
Joe Schumacher (CIESIN)
Bob Downs (CIESIN)
Chris Lenhardt (CIESIN)
Those operating data center activities are often called upon to justify the value of their work in terms of the scientific impact of the data they manage. Although anecdotal evidence of the use and importance of such data is often available, providing quantitative measures of such benefits is difficult. The increased availability of full-text search tools of scientific literature opens up the possibility of more systematic citation analysis to characterize data use by the broader scientific community. We report here on our exploratory efforts to compare alternative search strategies for selected socioeconomic datasets and examine possible citation metrics that could form the basis for assessing usage and impacts over time. These efforts also suggest some possible avenues for encouraging scientists to improve their citation of data, for improving literature databases and search tools, and for developing unique identifiers for datasets. The ability to track and quantify citations to digital content is also likely to be important for other uses such as career advancement and assessment of data quality.
At the IASSIST 2006 conference, we showed how a datum is a designation in the theory of terminology for special languages (The Nature of Data). Now we show that the description of a datum is terminological, also. Even though metadata are designations in the terminological sense, this not our focus. We mean that the metadata describing a datum follow the basic terminology framework consisting of concepts, their characteristics, and properties associated with those characteristics. There are 3 inter-relating components of the description of a datum: representation, datatype, and semantics. Therefore, the talk will focus on the basic terminological framework, the components of the description of a datum, how those components fit together, and how the components fit into the terminological framework. Finally, since the conference is about social science data, all the examples will come from statistical surveys. For example, we will show how the kinds of statistical data (nominal, ordinal, interval, and ratio) are really constituents of datatypes. In addition, units of measure, often associated with quantitative data, are seen as the bridge between datatypes and semantics. A full descriptive framework for a datum will be the result.
Persistent Identifiers: Cornerstone in a Web Oriented Scientific Environment
Maarten Hoogerwerf (DANS)
SurfShare is a national programme in the Netherlands which strives to create better access to high-grade (scientific) knowledge, at lower costs than is currently the case. This is feasible, as ICT not only accelerates the traditional communication processes, but also changes the nature of the knowledge chain. Traditional publications, instruments (models, algorithms, visualisations) and research data are increasingly interwoven due to the increased possibilities of knowledge sharing and dissemination. All the major scientific organisations in the Netherlands are cooperating to establish a joint infrastructure that advances the accessibility as well as the exchange of scientific information. In a web oriented infrastructure it is crucial that scientific publications and research data can be identified in a uniform and persistent way. Persistent Identifiers form an important component in the Netherlands joint infrastructure. This paper will address the organizational challenges and technical choices which are being made in the Netherlands and will focus on best practices and pragmatic solutions.
Using Pictures to Tell a Story: Mapping Economic Data for Researchers and the Public
Katrina Stierholz (St. Louis Fed)
The Federal Reserve Bank of St. Louis has historically been a source of economic data. The FRED database has over 15,000 economic series available. In order to improve the accessibility of our FRED data, the St. Louis Fed developed a data mapping tool called GeoFRED™. The speakers will discuss the underlying data and the configuration of the GIS tool (created using open source software). They will also describe the intended audiences and goals for the project, issues that surround mapping data, and the ways that the St. Louis Fed has used this tool to reach new audiences. Curriculum has been developed to provide teachers with lesson plans that incorporate GeoFRED and meet state educational goals for economics, geography, and social studies. The speakers will also demonstrate the GeoFRED™ website, its features and the underlying data.
Using the Web to Communicate Survey Metadata: Design, Development and Maintenance of the ESRC Question Bank
Julie Gibbs (University of Surrey)
This session is intended to be follow on from a session at the IASSIST conference in 2006 on web development and websites that were good examples for students wishing to find data. This paper will provide members with an update on recent developments! The ESRC Question Bank has been redesigned in 2006 with a new interface developed around current web standards and with user feedback in mind. This has been a great success, with user statistics showing a 70% increase in visits to the site! In this presentation I will discuss how the web interface was developed and why we have kept the rather flat structure of the Qb as opposed to developing a database for the survey questionnaires that we hold. I will also discuss the maintenance required for this type of resource, and how it can be used to get students to think about using or recycling survey questions for their own research work. The session will aim therefore to update the audience on current web designs for metadata dissemination with particular reference to the Qb whilst pondering whether we are making web dissemination too complex for students to use.
Data websites at small schools aim to serve a variety of constituencies: undergraduates and graduate students from a range of majors with a range of data skills, faculty of varying technical ability, and librarians without training in data who nonetheless must answer data questions at the reference desk. I report on what happened when a small school (Trinity College) redesigned its website to cater to these varied constituencies by being more inclusive of the varying formats in which library resources now come: not just journal databases, but audio/video resources, images, and, of course, data. With data collected from the new website on what resources get the most “hits” and how users exploit multiple entry points to access those resources, I highlight which resources are the most surprising in terms of the number of hits they now receive. Finally, I pose some theoretical questions concerning how we should think about data websites in the more general context of library websites.
Life Cycle of and Open Access to Research Data in Finland
Tuomas J. Alatera (Finnish Social Science Data Archive)
The OECD recently published its guidelines for access to research data from public funding. Motivated by the OECD guidelines, Finnish Social Science Data Archive conducted an Internet survey on the preservation and reuse of research data, targeting humanities, social sciences and behavioural sciences. The aim of this survey was to chart how the universities in Finland have organised the depositing of digital research data and to what extent the data are reused by the scientific community after the original research has been completed. Views were also investigated on whether confidentiality or research ethics issues, or problems related to copyright or information technology formed barriers to data reuse.
Barriers to Data Archiving and Sharing in Health Research – Lessons from a User Study
Lone Bredahl (Danish Data Archive)
In some research domains data archiving constitutes an integrated part of the research process. Here, data are deposited not only for personal reasons, and data are willingly shared with fellow researchers and other interested parties. In Danish health research, central archiving of research data broke new grounds in 2005 with the establishment of a separate entity for health data, DDA Health, at Danish Data Archives (DDA). To obtain a systematic overview of expectations and experiences on central archiving as well as of behaviours and attitudes with regard to data sharing, an empirical study was carried out in summer 2007 among depositors and potential users of data services in DDA Health. Data were collected by a combination of qualitative and quantitative methods. Initially, a focus group and five personal interviews were carried out. Following these, a web-based survey was conducted based on an extract of email addresses from the administrative database at DDA. Results point to low perceived necessity and lack of consideration of the issue of data preservation over all as major barriers to data archiving. Data sharing, on the other hand, is clearly (negatively) linked to perceptions of data ownership. At the same time there is no tradition for data sharing trough more formal channels, such as a central data archive.
Asian Social Science Data Accessibility
Daniel C. Tsang (UC Irvine)
This paper takes an overview of Asian social science data availability and accessibility in Asia and, to a certain extent, abroad. It covers not what is available in existing social science data archives, but what is also available from government agencies, survey research organizations and individual researchers. It looks at different cultures of data sharing across the region and what contributes to that. It addresses whether emerging standards (such as DDI) are applied in Asia and what needs to be done to close the gap between social science data archiving in the west and in Asia.
2008-05-28: B2: Tools for Data Visualization and Manipulation
Web 2.0 Data Visualisation Tools
Stuart McDonald (University of Edinburgh)
As Web 2.0 continues to evolve and transform into what is being referred to as Web 3.0 we are seeing the boundaries between websites and web services blurring as more and more web content becomes remixable. Many of the resultant visualizations and applications can be achieved with no more than a basic understanding of the underlying technologies. This presentation will discuss the range of collaborative web utilities which use Web2.0 technologies which venture into the numeric and spatial data visualisation arenas. There are a whole range of map (or spatial) mash-ups which utilize Web2.0 technologies and interactive mapping products such as Google Earth and MS Virtual Earth. Such mapping utilities have paved the way for research organisations to explore and expose their findings in new and innovative ways. There has been however less publicity regarding the visualisation of data, once thought to be the remit of domain experts. This presentation will also look at and compare a number of utilities such as Swivel and Many Eyes, that to varying degrees visualise data and allow data users the opportunity to interact with and share data in an open environment.
An unintended consequence of using social sciences data, especially from government sources, is that users end up devoting a significant amount of time to dealing with the mechanics of data cleanup that could (in theory) be otherwise spent thinking about the content and meaning of the data instead. Often that time is further devoted to learning a great deal about a single piece of proprietary software that may or may not continue to be available over time. Swivel appears to be a tool that resolves this problem by simplifying data visualization. It uses non-proprietary data formats as its input and automatically turns out a range of graphical displays for any table uploaded. But does this really help with the issue of misplaced effort by users? Even if it does, how might it be integrated into a data services program in an academic institution? I will be discussing my efforts to develop a more user-friendly version of the Consumer Expenditure Survey data via Swivel and where/how I think Swivel and similar tools might fit into a data services program.
The open source programming language Python is often recommended as a first language for those new to programming. Some have even argued that for those who program infrequently, it may be the only language they need to learn. For data services the Python language has a number of compelling features. It is easy to learn, has clean syntax, and features an extensive collection of modules to help address the sorts of “data munging” and administrative tasks that users often find themselves engaged in. Examples include automation of repetitive data manipulation processes, gluing together multiple applications in order to accomplish a complex task, extraction of data from websites and web services, and scripting for websites and servers. This presentation will give an overview of the features of the language of particular interest to data users. Comparisons will be made to other popular scripting languages. Modules relevant to data manipulation will be discussed. Finally, attention will be given to the recent integration of Python into SPSS and ArcGIS and how this might be relevant to data users.
2008-05-28: B3: What Is Old Is New Again
Canada Year Book Historical Collection
Bernie Gloyn (Statistics Canada)
First published in 1867 as the Year Book of British North America, the Canada Year Book became the premier reference resource on the social and economic life of Canada and its citizens, a function it still performs today. With funding from Heritage Canada, Statistics Canada has digitized the first hundred years of this important historical resource and is making it freely available on the Internet. An important step by the agency, it has allowed us to develop our expertise in making key historical resources more readily accessible, searchable and useable. The digitized year books are supplemented with tables, graphs and maps and are linked to a series of lesson plans to spark their use by teachers and students. Spanning more than 3 years to completion and exceeding a terabyte of data, the project has encountered and overcome a number of issues of interest to the IASSIST community. This presentation will give a brief history of the project, demonstrate the website, and highlight the issues overcome and future directions possible.
Moving an Archive from Tape to Disk: A Case-Study at ICPSR
Bryan Beecher (ICPSR)
In early 2007 ICPSR's digital archives consisted of 700 magnetic tapes stored in two separate locations in Ann Arbor, Michigan. The best-case scenario for retrieving content consisted of a data librarian finding the correct tape, mounting that tape, restoring the desired content, and then copying that content to a well-known location for the data manager. A worst-case scenario could include trips to off-site storage locations, composing content from a combination of Master and Backup tapes, and multiple attempts at finding just the right content. By late 2007 the archives have been copied to spinning disk, and replicated across storage grids. In true belt-and-suspenders fashion, an additional copy also resides on tape, and this copy of the entire archive fits on a mere six high-density tapes. The online content may be searched and browsed, and with sufficient access rights, an ICPSR data manager may fetch any file through a convenient web interface. This presentation describes the starting point of the migration, challenges faced and lessons learned during the process, and the state of the archives post-migration. We reference technologies that we found useful during this process, but do not probe too deeply into their intricacies.
Solving Study Metadata Puzzles: Case Studies from Roper Center Reprocessing Activities
Marc Maynard (Roper Center)
During the past several years the Roper Center has been reprocessing older data collections in order to make them available directly to researchers via RoperExpress download service. In a technical sense reprocessing of data files (particularly from IBM colbin binary-to-ASCII formats) seems fairly straightforward, but in order to bring more enhanced data resources, more fully developed metadata is desirable. But discovery and integration of methodological and variable level metadata for studies that are over 30 years old can be challenging. This paper will focus on several reprocessing case studies encountered over the past year, and from their unique scenarios try to develop a framework to guide this unique brand of detective work.
2008-05-28: C1: In Data We Trust: Maintaining Confidentiality, Authenticity and Quality
Access to Labor Force Data in Germany
Dana Mueller (Research Data Center of the German Federal Employment Agency at the Institute for Employment Research)
The access to administrative and confidential micro data has developed amazingly fast over the last years in Germany. One crucial factor to improve the statistical infrastructure was the establishment of research data centers. The focus of this presentation is to introduce the research data center (RDC) in Nuremberg that facilitates access to the confidential administrative and survey data of the Federal Employment Agency and the Institute for Employment Research. We provide cross- sectional and longitudinal data on individuals and firms. Data on individuals include comprehensive information on employment, unemployment and job search, our firm survey covers a wide range of aspects and is also available as linked employer- employee data. We offer different access methods which depend on the degree of data confidentiality: Anonymized scientific use files are send to the researcher, more confidential data can be analyzed via on-site use or remote execution. The access is free of charge and not restricted to German researchers. Researchers from abroad may get a grant to visit the RDC.
Continuing the GPO Trust Relationship through Authentication
Robin Haun-Mohamed (US Government Printing Office)
Gil Baldwin (US Government Printing Office)
Virginia Wiese (US Government Printing Office)
The U.S. Government Printing Office (GPO) has kept America informed by producing and distributing Federal Government information products for more than 140 years. With the rise of digital access to content, the Federal Depository Library Program (FDLP) has a mandate to ensure permanent public access to U.S. government published information. While GPO is involved in all aspects of data technology, including collection, communication, access and preservation, this presentation will focus on GPO’s state of the art authentication initiatives. Recognizing that confidentiality, data integrity, and non-repudiation are critical, GPO’s primary objective is to assure users that the information made available electronically is official and/or authentic and that trust relationships exist between all participants in electronic transactions. Through the application of Public Key Infrastructure (PKI) technology, electronic documents will bear a digital signature of GPO’s authentication logo, identifying different levels of authentication and validating the document has not been altered. GPO’s Federal Digital System (FDsys) is the advanced digital system and critical technology that will enable stewardship throughout the content lifecycle. The first live release of the system in late 2008 will establish the foundation for an OAIS preservation archive replacing GPO Access, the agency’s existing public access Web site.
2008-05-28: C2: Facilitating Data Access: Developing Multi-Function Access to Data Collections
ODESI: Creation of a Web-based Data Exploration Portal
Paula Hurtubise (Carleton University)
Through standards, design and technology, ODESI, the Ontario Data Documentation, Extraction Service and Infrastructure project has created a web portal for university researchers, academics and students, which render them discriminating and informed users of a vast collection of social research data. This sophisticated data portal, with supporting DDI compliant metadata, houses microdata, such as Gallup Polls and Statistic’s Canada census files, which can be searched, browsed, analysed and downloaded. It eliminates the steep learning curve associated with the use of microdata files. The ODESI exploration tool facilitates investigation and creative data intervention making even the novice an autonomous and innovative researcher. This paper provides an overview of the ODESI project from its inception three years ago to implementation today. Acquisition of budget, informatics architecture, communications strategies and the development of key partnerships will all be discussed. ODESI inspires, develops and supports research excellence in the academic environment.
Providing Access to Born Digital Archival Data in an Era of Search Engines
Margaret O. Adams (U.S. National Archives and Records Administration)
Five years ago, the U.S. National Archives and Records Administration launched an online search and retrieval tool for access to a selection of its “born digital” data records (the Access to Archival Databases (AAD), www.archives.gov/aad). In the intervening years we have observed a variety of patterns in the use of this resource and in the types of records queried most frequently. Those patterns have, in turn, influenced the new series added to the resource. The volume of users and the queries they run, now averaging over five thousand daily, substantially exceed all measures of interpersonal reference demand in the National Archives’ custodial electronic records program. The growth and ubiquity of search engines during the same time period have influenced enhancements of the resource. This presentation will offer an overview of these experiences and focus on the ways in which the online resource has had an impact on interpersonal reference services, and has resulted in an expansion of the population using archival digital records in the holdings of the National Archives. While some of the lessons learned are most relevant for other public archives, the overall impact of and rising expectations for the online environment in the public sphere have implications for the academic research environment as well.
Welcome to the SodaPop Shop - Data Fast and Fizzy and in Many Flavors
Kiet Bang (PRI, Penn State)
The Population Research Institute developed the Simple Online Data Archive for Population Studies (SodaPop) in 1995 to provide our researchers with a web accessible data archive of primary and secondary data files. In this paper, we will share our experience with the development of SodaPop as a value-added data discovery and analysis tool. SodaPop provides descriptive dataset information including links to abstracts and primary sources; facilitates variable searching within the content of individual datasets and across the collection; and allows users to create an “on the fly” extraction of a customized, analysis-ready data subset in all major statistical package formats (SAS, Stata, or SPSS dataset and comma or tab delimited format). SodaPop has also transitioned from an in house tool to a campus and public resource.
2008-05-28: C4: International Outreach: Open Discussion
2008-05-29: Plenary II
2008-05-29: D1: Metadata: Enhancing Access to Data Resources
Data and Knowledge Management at the Federal Reserve Board
Andy Boettcher (Federal Reserve Board)
The Federal Reserve Board purchases and creates numerous datasets to support its role in monetary policy, banking regulation, and consumer protection. To better manage these datasets, the Board has built a metadata repository called the Data and News CataloguE (DANCE) to store descriptive dataset characteristics. The growing number of datasets and their corresponding security and licensing issues motivated a data initiative, in which the Board's research community identified enhancements that could be made to DANCE. Planned improvements include the addition of Dublin-Core standard metadata, the communication of changes in metadata, and the dissemination of metadata on new datasets. The improvements are expected to foster more collaboration which will enable better research. This paper will chronicle DANCE's original role within the organization and its transformation into a knowledge management solution.
Making Sense of the Census: Creating a Census Aggregate Information Resource Demonstrator
Justin Hayes (Mimas, University of Manchester)
UK censuses are fundamental tools for social scientists interested in conditions within the UK. However, use of aggregate outputs from the UK and other censuses has always been severely limited by the separation and fragmentation of data and metadata, with meaning effectively encrypted in graphical table layouts, or buried within ‘supporting documentation’. This has precluded the development of anything but rudimentary search and exploration applications, and frustrated attempts to create machine-readable services to make the aggregate information more widely available and accessible. This paper describes a project funded by the UK Economic and Social Research Council to create a Census Aggregate Information Resource Demonstrator (CAIRD), which will combine data and metadata from the UK 2001 Census through the use of developments in open-standards structured XML, and XML-aware database systems to create a machine-readable, application-ready online service. Some of the many potential benefits offered by CAIRD include the facilitation of advanced search and exploration applications; flexible generation of aggregate information to user request; efficiency in storage, maintenance, update and operation; and encouragement of secondary innovation and the development of online user communities. The main aim of the CAIRD project is to demonstrate the feasibility of this approach in order to encourage adoption of similar methods by the national census agencies for aggregate outputs from the UK 2011 Census.
Searching for Datasets and Variables with SDA
Tom Piazza (UC Berkeley)
Charlie Thomas (UC Berkeley)
The SDA team is developing new methods that will allow users to search an SDA archive for datasets of interest and also to search for variables both within a single dataset and across multiple datasets. Users can search for words or phrases in basic core fields in the study-level descriptions or in the variable-level metadata. We will show how searching looks to the student or researcher. We will also give an overview of how the archivist would set up these capabilities for an SDA archive of datasets.
Variables, Datasets, and Finding What You Want
Dr. Anthony Rafferty (University of Manchester)
Sam Smith (University of Manchester)
A common problem when searching repositories for secondary data is finding useful data to meet specific requirements. Variables are a fundamental building block of data analysis and usage. This talk covers the benefits to users from a search system that generates information and cross-references for variables in each file in the +250 large-scale UK Government datasets supported by ESDS Government. Use of broad but highly targeted search along with the integration of a variety of sources of data, documentation, and metadata facilitates a powerful search platform. While the examples will be from our service, the talk will include suitable references to other international systems.
2008-05-29: D3: Numeracy, Quantitative Reasoning and Teaching about Data
Innovation in the Use of Data for Teaching and Research : The Russian Case
Anna Bogomolova (Moscow State University)
Tatyana Yudina (Moscow State University)
Capacity building and statistical literacy in Russia is a demand for effective public administration. Universities are at a leading edge - it is a challenge to work out educational courses and training modules to teach state statistics analysis. Moscow State University Research Computing Center and Economic Faculty has accomplished an information system that integrates state statistics - socio-economic and budget data available at federal, regional and local levels. Each indicator may be monitored with 10 years coverage and analyzed exploiting applied math methods and developed models. Data from other state agencies will be added in 2008. The special point - is development of detailed methodology to indicators as an element of academic service. The database is used for investigations and innovated education programs on regions of RF. In recent months interest to the database is growing among government agencies - it provides for system analyses in support for decision making at federal, regional, local level.
Numeracy at the University of Guelph: One Year Later - Where Are We Now?
Michelle Edwards (University of Guelph)
In 2006-2007, the Data Resource Centre was the lead partner in a multidisciplinary project entitled “Numeracy and Quantitative Reasoning Initiative” at the University of Guelph. The goal of the project was to build new opportunities to improve numeracy and quantitative reasoning skills, and to help students overcome their insecurities around dealing with numbers. The repository was built and populated during the summer and now houses several learning objects covering topics from numbers to graphing data. This paper will showcase the Numeracy repository and the format of the learning objects contained within. The paper will also discuss how the project took shape and the different ways it is being used by the University of Guelph faculty.
Teaching, Testing, and Assessment in a Quantitative Reasoning Course: Taking Aim at a Missing/Moving Target
Lisa Neidert (Population Studies Center, University of Michigan)
This paper explores some challenges in teaching a class that satisfies the quantitative reasoning requirement at a large research university. The presentation has concrete examples from the classroom setting. These illustrative examples are to provoke discussion, rather than to be additional entries into a statistical literacy toolbox. The main issues to be covered are (a) determining the appropriate mix for class content (subject matter vs quantitative exercises); (b) testing the students; and (c) assessing the course – what did the students learn? The University of Michigan has had a quantitative reasoning requirement for students in the College of Literature, Arts, and Sciences (LSA) for almost 15 years, but there is no LSA-wide oversight of these courses – thus the “missing target” in the title. The “moving target” in the title describes my changing perspectives on what a quantitative reasoning course should be and a way to deliver this product. The presentation will describe some of the changes in the course focus and end with issues of student testing and course assessment.
This session will describe two iterations of an effort create a quantitative research module for a Masters in Nursing research methods course at the University of Windsor. The first version involved a single three-hour class incorporating both a lecture and a hands-on practice session followed by an assignment to independently locate and analyze a dataset, with extensive support from the library. The second version was both more extensive and more structured, with a three-hour lecture, an assigned reading, a three-hour practice session and an analysis assignment using a pre-selected data set. This session will discuss what worked and what didn’t work and will include an analysis of feedback from an anonymous questionnaire filled out by the students following the second unit.
2008-05-29: E1: Data Security and Access: Connecting from Afar
The Development of Remote Access Systems
Tanvi Desai (Research Laboratory, London School of Economics)
The paper will outline the history of the development of remote access systems, in particular for access to microdata. I will then look at the types of remote access solution in use today by various data providers internationally and assess the strengths and weaknesses in each. Strengths and weaknesses will be judged primarily in terms of ease of use, data quality, data accessibility, data security, and support burden on the data provider.
(Meta)Data and Remote Computing at IdZA: Experiences from IZA
Nikos Askitas (IZA)
The IdZA at IZA is a Data Service Center with its own Data Enclave and related technology, whose primary focus is (meta)data relevant for labor economics. The Enclave supports all known approaches to making data available (Ultra-thin Computing Environment, Remote Computing) and some developed in-house (JoSuA). The DSC is making metadata about German data available to a large international Fellow network and beyond, using DDI and other standards for its documentation. One of the major undertakings of the DSC is to provide DDI based English translations of German (meta)data on a large array of datasets relevant to labor economics. The talk will go over the experiences gathered the last 3 or so years, the shortcomings of DDI 2.*, the ways in which these were mended and what we hope for in DDI3.* Some of the newer ambitions and undertakings of IdZA and the context in which (meta)data and remote computing are seen as complementary will also be mentioned.
A great deal of attention has recently been paid to promoting researcher access to statistical microdata. In this paper, we describe the NORC Data Enclave which offers a secure mechanism for data custodians (e.g., Federal statistical agencies, foundations, etc.) to provide approved researchers access to sensitive business microdata. The enclave offers two modes of access: remote and onsite. We will highlight several innovative features of the enclave. First, the Data Enclave uses a portfolio approach to provide access, whereby physical and logical security technologies are combined with statistical, educational, legal and organizational features to protect confidentiality. Second, the Data Enclave provides a platform for collaboration for geographically dispersed researchers working on an approved research project. Third, data custodians can become directly engaged with researchers in producing and providing DDI compliant metadata documentation by means of blogs and wikis during the research process. The environment extends to providing access to Stata and SAS code to promote replication of research. Finally, researchers can work with data custodians and other researchers to add new data through linkages and addition of new datasets to further build a rich, community-based database infrastructure.
2008-05-29: E2: Under the Hood: Choosing a Standard
Practical Metadata Lessons: Utilising Metadata Standards for Archiving Data at Statistics New Zealand
Euan Cochrane (Statistics New Zealand)
At Statistics New Zealand we have been developing a data archiving solution in preparation for an organisation wide strategy (The "Business Model Transformation Strategy" (BmTS)) which will redesign processes, systems and tools for managing and storing data and metadata. The data archiving solution has been developed to apply to our current data stores and to archive surveys which will not transition as part of the BmTS. To fulfil the needs of the metadata component of the archiving solution we have used two standards: the Data Documentation Initiative (DDI) and the Preservation Metadata Implementation Strategies (PREMIS). The implementation of these standards has proven both a challenge and a learning experience. This presentation will cover some of the lessons we have learnt when implementing the two standards, as separate entities and in conjunction with each other, along with some of the benefits that have come from utilizing such open and established standards.
SDMX and the DDI: Using the Right Tool for the Job
Arofan Gregory (Open Data Foundation)
This paper covers the major features of the SDMX standard, and positions it relative to DDI versions 1.*/2.* and 3.0. It describes the typical use cases for each of the standards, and how to make an informed decision about which one best fits your needs. The two standards are complimentary, and the way in which they can be usefully employed in a single system is also addressed. There is an obvious overlap when dealing with multi-dimensional data, but there are many other points of alignment between the standards, and these are presented. Although DDI 3.0 was designed to be used in registry applications, it contains no specification of a standard registry, which SDMX does. The integration of DDI metadata into a standard SDMX registry implementation is described, to support collection, dissemination, data sourcing, and question, concept, and variable banks. Available tools and their integration are also discussed.
Using XBRL to Reengineer a Data Collection and Collaboration Process
Linda Powell (Federal Reserve Board)
In 2003, three U.S. banking regulatory agencies combined resources to revolutionize the collection, editing, storage, and dissemination of Commercial Bank Reports of Income and Condition. The regulatory agencies relied heavily on web-based technology and the XBRL transmission protocol. This paper will review the creation of an interagency data collection and dissemination facility. It will focus on the business problem that needed to be solved, the evolution of the technology that enabled the project, and what is XBRL and why was it selected as the transmission protocol. The paper will also review the challenges and benefits associated with using a standard transmission protocol versus creating a customized XML transmission facility.
If any of you have ever read the popular book by Nicholas Carr titled, "Does IT Matter", you know that he is being intentionally provocative in challenging us to redefine how we invest in Information Technology. In his book he talks about proprietary and infrastructural technology, and lumps most IT into latter. The premise is that the true value of IT is not fully realized until it is broadly shared, homogenized and standardized. In essence, we lose our ability, or need, to differentiate ourselves and we are at a point where innovation on an individual/institutional level will not lead to a meaningful advantage. That does not preclude this innovation, it simply states that the true value lies in sharing the innovation. Although his book is geared towards the private sector, there are many parallels to the world of Higher Education and certainly the financial realities are prominent. Faced with inevitable budget constraints, the maturation of DDI, and advances in other technologies the need to articulate a shared vision and act collectively becomes imperative. In this part of the panel we will talk about commoditization of IT and how we painted a vision for shared development, resources and access with the ODESI project.
Tomorrow - Ensuring Sustainable Data and Metadata for the Future
Mary Vardigan (ICPSR)
How do we ensure that the digital assets we create, enhance, and disseminate are preserved for future generations and remain usable for research, despite rapid-paced technological change? How do we protect the investment we make in data resources over the full life course of a project and not lose information along the way? These are questions that the social science data archives and others concerned with the development of cyberinfrastructure need to answer as we look to tomorrow. This presentation will focus on the big picture in terms of recent developments in the field of digital preservation and then will narrow in scope to discuss the role of DDI in a sustainable digital preservation program; the use of DDI at ICPSR and in a new project that covers the data life cycle; and finally some challenges remaining that we collectively need to solve.
2008-05-30: F1: Implementation, Application, and Sharing of DDI Resources
Metadata Share Project (MSP)
Joel Herndon (Duke University)
Rob O'Reilly (Emory University)
While the exponential growth of web based data sources has expanded access to the research community, this same growth has presented a series of challenges to Data Libraries that attempt to promote a mixture of online data resources alongside library licensed resources. Even with advanced content management systems, many Data Libraries devote a great deal of effort describing similar sets of web based data resources for local patrons. The Metadata Share Project attempts to reduce the effort required to document the growing number of web based data sources by sharing DDI compliant data descriptions across research libraries. This paper describes a test initiative by Duke University's Data GIS Services and Emory University's Electronic Data Center to share library DDI resources in order to expand resources at both institutions while reducing the burden of documenting data resources in DDI. We hope to expand the discussion of sharing DDI resources across data libraries by discussing our experience.
Creating Enriched Publications with MPEG-21 DIDL, DDI 3.0 and Primary Research Data
Rob Grim (Tilburg University)
Paul Plaatsman (Erasmus Data Service Centre)
One of the challenges for both scientific researchers and research libraries in the eScience era is the creation of scientific publication packages (SPP’s)[1], wherein publications are combined with the primary resources that were used for the publication, such as for example, research data, statistical programming code or stimulus materials. End 2007 the library of the University of Tilburg and the Erasmus University Rotterdam (EUR) started an interuniversity and multidisciplinary SURF[2] financed project called “Together in Sharing”, that aims to create SPP’s for economic and social science research domains. For this project, primary research data were used form surveys (European Values Study), experimental economics (CentERlab) and finance (EUR). The project used MPEG-21 DIDL as a general data model to represent and pack the digital objects in a SPP. DDI 3.0 instances are created to capture the metadata that researchers consider relevant for the enhanced publications, and as a means to build metadata records that can be harvested within a library portal environment. The production of SPP’s also raises more fundamental questions, i.e. where do we store SPP’s in a national infrastructure that is equipped for archiving research data separately from publications.
DDI 3.0: Final Revisions and Future Directions
Arofan Gregory (Open Data Foundation)
Wendy Thomas (Minnesota Population Center)
DDI 3.0 represented a major change from preceding versions of the standard, and was subject to several rounds of internal and public review as it developed. The last stage of review was a Candidate period of testing prior to its final release. This stage included applying the proposed standard to a variety of different real-world data sets, and its implementation in software tools. This paper summarizes the experience of the DDI Technical Implementation Committee during this final phase. It will highlight what was learned from the use cases and software implementations, in terms of both process and required adjustments in the standard, as well as discuss the anticipated future direction of the standard as it matures.
Documentation of German Labor Force Data at the IAB: First Experiences with DDI 3.0
Claudia Lehnert (Institute for Employment Research)
Joachim Wackerow (GESIS/ZUMA)
The Federal Employment Agency (BA) is one of the most important producers of administrative data about the labor market in Germany. BA data are collected in the notification process of the social security system and in BA internal procedures for computer-aided benefit allowance, job placement and the administration of employment and training measures. The preparation and documentation of these process-generated data for researchers is performed by the Institute for Employment Research (IAB). Most of these data have been documented in PDF and MS WORD formatted codebooks, with some information contained in an SQL database, but the documentation in the current structure is coming up against limiting factors. We have therefore started to establish a new database organized according to the DDI 3.0 standard. This enables us to reuse metadata for different data collections and to track changes of variable definitions and frequencies over time. Our presentation focuses on the general structure of the documentation for administrative data, especially the use of different data sources for several data sets. Our initial experiences with DDI 3.0 will be illustrated by a structural outline and selected examples.
2008-05-30: F2: Integration and Linking: Bringing Data and Documents Together
Data in DSpace: Linking Archival Primary Documents and Quantitative Datasets
Ann Marshall (University of Rochester)
This paper investigates the potential for archiving primary source documents and the datasets created from these documents in the institutional repository DSpace. Through an initiative at the University of Rochester (U.R.), the Friends of the U.R. Libraries awarded a dissertation grant in support of depositing a unique dataset into the University’s digital archive. This grant funded the acquisition and digitization of WWII military documents currently located at the Bundesarchiv in Berlin, Germany. On condition of awarding these funds, the doctoral student agreed to deposit both the digitized primary documents and the unique dataset created from these documents into DSpace. This approach has the potential to increase the awareness and use of DSpace, while also capitalizing on the contribution that doctoral students might bring to data depositories. The paper also discusses the use of DSpace technology as a data depository and considers current and future enhancements to DSpace. Issues such as metadata, the availability of funds, interface functionality, and copyright are important considerations for expanding this initiative.
Implementing a Digital Repository for the Preservation of Interdisciplinary Data
Robert R. Downs (CIESIN, Columbia University)
Robert S. Chen (CIESIN, Columbia University)
Digital scientific data created during the last few decades offer potential for analysis by future users and for integration with other data from different disciplines to support interdisciplinary analysis, discovery, decision-making, and education. However, significant barriers remain in managing and documenting such data sufficiently to meet the needs of future and interdisciplinary users. One possible approach to overcoming these barriers is to develop and implement digital repository systems within an appropriate institutional context. We report here on progress in implementing a digital repository using the Fedora open source software, working with the Columbia University Libraries. After discussing platform selection, feasibility testing, and collection development policy issues, we describe our experience with data migration and parallel ingest of data. We then discuss current system enhancements, challenges, and plans to improve capabilities for ingesting data and for enabling dissemination that supports future applications and use.
Tanja Hethey (Research Data Centre of the German Federal Employment Agency at the Institute for Employment Research)
Anja Spengler (Research Data Centre of the German Federal Employment Agency at the Institute for Employment Research)
In Germany, process-generated data and survey data on firms are collected by different data producers. Each data producer provides access for researchers to its data, but the combination of datasets from different producers is not possible at the moment. The KombiFiD project aims to overcome this limitation: firm data collected by the German Statistical Offices, the German Central Bank and the Federal Employment Agency will be merged for the first time. Our goals are twofold: to gauge the possibilities of merging selected datasets beyond the limits of individual labour market data producers, and to provide combined datasets to science, thereby creating new research opportunities. Our presentation outlines the status quo of the project. We describe the datasets selected for merging and explain potential merging problems. Moreover we address research questions which can be analyzed for the first time with the unique new data. Amongst others, this is the possibility to monitor the history of businesses on the basis of their combined single units.
We Inhabit the Same World: Integrating Socio-economic and Environmental Data
Dr. Veerle Van den Eynden (UKDA)
The Rural Economy and Land Use programme provides examples of how interdisciplinary research projects carried out by teams of social and natural scientists combine the use of socio-economic and environmental data. Data may be integrated through spatial integration (GIS), modelling, relational databases or data conceptualisations and visualisations. Integrations must take into account differences in data scale, area and framework. Whilst social science data are usually organised according to administrative areas, natural science data are based on grids or ecological zoning. Researchers use different approaches to optimise communication between such diverging data. Experiences of data integration also provide information on how best to organise and archive data to enable their long-term use within various research disciplines.
2008-05-30: F3: The Challenges of Data Preservation
Aligning Digital Preservation Policies with Community Standards
Nancy McGovern (ICPSR)
Digital preservation policies are an essential component of an organization's digital preservation program. Yet, recent surveys show that many organizations that manage digital content do not have an explicit policy statement to delineate the mandate, purpose, scope, principles, and objectives of their digital preservation program. The Digital Preservation Management workshop series developed at Cornell University by Anne R. Kenney and Nancy Y. McGovern (Digital Preservation Officer at ICPSR) between 2003 and 2006 produced version 1.0 of a digital preservation policy framework. The resulting framework is a high-level policy document that is structured to be sharable with other organizations. Version 2.0 of the digital preservation policy framework was developed and tested at ICPSR. It builds on the Cornell model by aligning the components of the framework with the attributes of a trusted digital repository and incorporating key components of the Open Archival Information System (OAIS) Reference Model. This paper will discuss the digital preservation policy framework, present examples from the version 1.0 and version 2.0 models, discuss the structure and development of a comprehensive set of digital preservation policies for an organization, consider the connections between recent research and development on policy engines for digital preservation, and propose next steps for community policy development.
Challenges in Preserving Neuroimaging Research Data
Angus Whyte (University of Edinburgh, Digital Curation Centre)
Preserving neuroimaging research data for sharing and re-use involves practical challenges for those concerned in its use and curation. These are exemplified in a case study of a psychiatry research group. The study is one of a series encompassing two aims; firstly to discover more about disciplinary approaches and attitudes to digital curation through “immersion” in selected cases; secondly to apply known good practice, and where possible to identify new lessons from practice in the selected discipline areas. These aims were addressed through ethnographic study of current practices, and action research to assess risks, challenges, and opportunities for change. The challenges are in some ways archetypal of fields that are embracing “e-science;” how to reconfigure practice to improve data sharing and re-use, given the capabilities afforded by “cyberinfrastructure. “ The evolution of those in neuroimaging is tied to the social and technological infrastructure underpinning the domain, and imaging centres such as the psychiatric research group in question. Its preservation challenges may be understood by examining relationships between its history, nature of the data collected, innovations in analysis, practices of sharing data and methods, and the evolution of data repositories in the domain.
Planning Against Failure – It's Not All about Technology
Dr. Lucia Lotter (Human Sciences Research Council)
Marie-Louise van Wyk (Human Sciences Research Council)
This presentation will illustrate that a successful data curation solution can be implemented without an excessive investment in technology and resources. While more than seventy percent of information technology projects fail, only five percent of such failures can be attributed directly to technology. It is thus essential to understand the factors that contribute to these failures and to ensure that preventative measures are put in place.The presentation will - by raising selected issues - address the implementation of data curation in a research organisation. It will highlight challenges relating to (1) information technology methodology (2) executive custody (3) strategic alignment (4) funding / resources (5) data engineering and data management (6) people / soft issues and (7) technology issues. There will be discussion of practical ways of dealing with obstacles, as well as illustrations of the supportive role that technology can play in the implementation process.
Preserving Social Science Data: How Much Replication Do We Need?
Myron P. Gutmann (ICPSR)
Nancy Y. McGovern (ICPSR)
Bryan Beecher (ICPSR)
T.E, Raghunathan (University of Michigan)
Those responsible for digital preservation are aware of a tension between the need to expend resources on preservation and the scarcity of those resources. Ideal preservation would save many copies forever, but this has a large potential cost. We need to be certain that we are preserving the right number of replicas. The paper raises issues that derive from a core attribute of most social science data, which is that social science data is often created by drawing random samples from a population and studying the behavior or attributes of the sample. The sampled character of these data has implications for preservation. While it is less than desirable to lose cases from a sample, even after some loss the sample still has validity and can be used for future research. From this the paper argues that replication for preservation purposes may require thinking at the level of cases or variables and not entire data files. There may be varying numbers of replicas within a data file, depending on the attributes of the overall sample, and the attributes of cases and variables. The situation is also more complex because of the need to protect confidentiality of data.
2008-05-30: G1: Innovation in the Use of International Data for Teaching and Learning
Measuring Development Results: The Story Behind the Numbers
Eric Swanson (World Bank)
The World Bank's World Development Indicators database provides access to more than 800 statistical series for 209 economies. A comprehensive, consistent global database is essential to building a culture of evidence-based decision making and increasing the effectiveness of development programs. In this talk I will discuss the underlying sources of data for the WDI and efforts to improve the availability and reliability of data produced by developing countries. These data are being incorporated into results measurement systems to encourage greater accountability on the part of aid donors and recipients.
Organisation for Economic Cooperation and Development
Joachim Doll (Organisation for Economic Cooperation and Development)
Recent teaching and learning developments at the OECD include the organization’s three new data dissemination services, a new data warehousing system, new visualization interfaces, and a tentative exploration of Web 2.0 technologies. This talk will outline these recent developments designed to address the needs of a wide variety of new audiences.
Learning and Teaching with the ESDS International Data Service
Jackie Carter (Mimas, University of Manchester)
ESDS International is a UK-wide national data service providing free web-based access to regularly updated international databanks produced by intergovernmental agencies. We also help users locating and acquiring international survey data and provide a helpdesk, support materials and learning and teaching resources, introductory awareness raising courses; and interactive visualisation interfaces. This presentation will focus on two key learning and teaching resources available through ESDS International. The Countries and Citizens e-learning materials are a comprehensive course on combining international aggregate and survey data. Written by subject specialists, the materials include PowerPoint slides, PDF documents and streamed video files. The second resource to be described is the e-learning materials based around the UN’s Millennium Development Goals (MDGs). These materials are currently in development with an anticipated release date later this year. They are intended to provide a useful and effective re-usable e-learning package on the MDGs which will be available to the entire international data community.
IRB Issues and Archival Data: From Data Deposit to Data Use
Amy Pienta (ICPSR)
IRB issues as they relate to archival data are quite wide-ranging. This presentation will describe the importance of understanding the IRB process with respect to archiving and use of secondary research data. IRB often provide oversight of data archiving plans and informed consent statements may prohibit public archiving of data. ICPSR will present examples of informed consents that should and should not be used when a researcher intends to provide long-term access to data through a data archive. With respect to use of secondary data, some IRBs in the U.S. recognize publicly archived data as being exempt from IRB review when secondary analysis is proposed. The University of Michigan recently instituted a process of adding public-use data to a list of pre-approved data that do not need IRB-clearance prior to analysis. These examples will be discussed in this presentation.
The Digital Locked File Cabinet: A Problem of Metaphor
Thomas Lindsay (University of Minnesota)
Kristen Houlton (University of Minnesota)
In which I'd like to present on the problem that the standard of respondent data security for researchers has been the locked file cabinet. A simple concept that all researchers understand and adhere to, it continues to be the metaphor for data security in the digital age. But although the locked file cabinet itself is a simple concept, it becomes incredibly difficult to interpret when used as the standard for digital data protection. With recent data privacy laws, each institution's IRB has been given the task of determining on a case-by-case basis whether a researcher's data security plan adheres to this metaphor. Although IRBs are supposed to be both policy and enforcement bodies located in each institution, they are often underfunded, overworked, and ill-equipped to deal with the technical complexities of the research presented. Often IRB proposals are approved or rejected in an inconsistent manner, based on non-technical issues or incorrect understandings. So it falls to data security professionals in each institution to work with the IRB to develop standards for their operations that meet the goals. This talk will look at the many stakeholders and address the needs, desires, and obligations of each, and will explain how the CLA Survey Services at the University of Minnesota developed a standard for interpreting the digital metaphor of the locked file cabinet. We will address the important but delicate position we have found ourselves in as the holders of the data and as the occasional intermediaries between researchers and the IRB.
Becoming a Legitimate Data Repository: When Policy and Practice Collide
Libbie Stephenson (UCLA)
The data archive at UCLA has been operating since 1977 and disseminates publicly available data. Federal guidelines on human subjects’ protection are interpreted at UCLA to require each user of one of the public data files in the collection to file for a research review or to become certified exempt from review. Actual practices of researchers and enforcement of the Federal guidelines are at odds with this requirement. This presentation will discuss the process in which the data archive applied to become a repository of public data to bring actual practice in line with the campus interpretation of Federal guidelines for protection of human subjects. As part of the process a resource for researchers on Data Sharing and Responsible Use was developed. The resource and its attributes will be shared.
2008-05-30: G3: Beyond Numbers: Preserving and Delivering Non-numeric Collections
Shakespeare 2.0 - New Challenges in Preservation
John Venecek (University of Central Florida)
Elizabeth Konzak (Hoover Institution)
This paper will discuss the preservation challenges faced at the conclusion of a semester-long study to determine how effectively wikis can facilitate collaborative research in undergraduate learners. The study was conducted at the University of Central Florida in Dr. Katherine Giglio's fall 2007 Shakespeare course, which focused primarily on the social identities that pertain to Shakespeare's life, work and times. In small groups, students created wikis based on specific identities such as women, men, knights, fools, lovers and villains. Students collaboratively constructed wikis that would serve as research guides and would incorporate a wide range of primary and secondary source material. Once the project was complete, however, issues related to preservation and the intellectual property rights of our students were of primary concern. Our presentation will focus specifically on how current trends in Web 2.0 and open source publishing as vehicles for collaboration will impact projects such as ours with an eye toward intellectual property rights as well as methods of appraisal and preservation in this highly collaborative environment of ever-changing technologies.
Sounding It Out: Sharing and Disseminating Audio-Visual Data
Mus Ahmet and Louise Corti (UKDA)
Increasingly data archives are confronting new kinds of media. Qualitative data is traditionally captured as an audio source, and increasingly visually, yet the sharing of that data can be problemetic. Format standards, storage capacity and consent to share have been the main barrier. The UKDA currently delivers audio in MP3 format via authenticated web-based downloads, but has been investigating enhanced delivery solutions and long term open storage formats. The goal being to deliver a flexible open-source streaming media format with rich metadata content. The precise description and representation of audio/visual data is also challenge. UKDA has been working on metadata schema for complex multi-media collections – including using METS and those for relating text and audio-visual. Item level description for complex multi media collections adds power to the data. UKDA has been leading some exciting developments in: minimising data storage requirements whilst simultaneously maximising audio quality; and representing audio-visual sources using hypermedia and FEDORA-based systems. This paper will provide an overview and discussion of these developments and consider future best practice and guidance.
Two New Content Services on the 1956 Institute Portal
Zoltan Lux (1956 Institute)
The 50th anniversary of the 1956 Hungarian Revolution encouraged many new written and electronic works. We have now created a database of these works (books, films, websites and conferences) and of reviews of them (www.rev.hu/reviews). The photographic documentation database has taken a further step forward with the project entitled “Processing of Photographic Life’s Works”. Our plans for this are to process the work of four to six living photographers each year. It will include making a life interview with each photographer, which can be read and heard on the internet service. The photographs selected to present the life’s work can be ordered for various uses or even commented upon online. The content service will also be accessible in English. (Under development: www.rev.hu/photographers.) Both the projects will be developed further. The subject-area of the “Review” database will be broadened, with the eventual intention of having it cover historical publications on contemporary Hungarian history (if possible in two languages). This is more a problem of content development. We would also like to adapt the framework system of the “Photographic Life’s Works” database to our earlier photo-documentary database (http://server2001.rev.hu/oha/index_e.html).
2008-05-30: H1: The CESSDA ESFRI Project - Setting Up a One-Stop Shop for European Data
The Vision and Requirements - Legal, Financial, Governance
Kevin Schurer (UK Data Archive)
The principle outcomes of the achievement of these objectives will be the establishment of CESSDA as a legally constructed entity, membership of which will mark organisations as quality assured centres of expertise in the preservation, management and dissemination of research resources. Also, the undertaking will position CESSDA as the first choice for the deposit of data by both national and pan-European bodies, for example Eurostat and DG Research. In line with the recently published NSF (National Science Foundation) report on cyberinfrastructures, the broad vision of the CESSDA RI upgrade can be stated as wishing to develop ‘a system of …. data collections that is open, extensible, and evolvable; and to support development of a new generation of tools and services for data discovery, integration, visualization, analysis and preservation…[consisting of]…. a range of data collections and managing organizations, networked together in a flexible technical architecture using standard, open protocols and interfaces, and designed to contribute to the emerging global information commons.’
The Infrastructure - Availability, Authentication and Access
Atle Alvheim (NSD)
Vigdis Kvalheim (NSD)
The project will undertake strategic work to plan functionality enhancements in data access and data exploration. It will also develop functional specifications for a Shibboleth-based single data-users registration and Sign-On system, incorporating a common CESSDA-wide End-user-licence and logging of data use, for statistical reporting and evaluation purposes and potential solutions to the problem of a common identifier system for data resources, with a version identification for version control purposes. However, all this is dependent upon some thorough procedural groundwork. Europe still consists of some 30+ independent countries with national lawgiving protecting data and privacy. The project aims to build an infrastructure of content and the paper will discuss this as a two-fold problem, availability and access, technical and procedural (legal, administrative and procedures) to ensure access and use. Who can access what and for which purposes is still a very relevant question.
The Services - Data Discovery, Harmonisation, Analysis and Dissemination
Uwe Jensen (GESIS-ZA)
The project will setup strategic plans for developments on metadata, data models and software upgrades for data and metadata capture, processing and management which will complement the guiding services in online publishing, discovery, access and analysis of complex data types. Particular attention will be given to the advanced options of DDI 3.0 to handle complex datasets along the entire data-life-cycle. The extension of the thesaurus to other languages will enable natural language resource discovery for an increased number of European researchers. Accordingly developments to the CESSDA portal will enhance the system for researchers by facilitating access to resources discovered via the thesaurus. Needs in comparative research associated with data harmonisation are subject of a particular endeavour. The identification of the demands for harmonisation to standards and conceptual work on metadata structures for harmonisation rules and conversion keys are the central issues on the content side. The draft for a database and the construction of middleware to facilitate the application of harmonisation techniques through the portal are key tasks to setup the functional specifications and realize related technical solutions.
The Staff - Professionalisation, Network of Excellence, Training
Adrian Dusa (RODA)
The project will deal with the skills gap between the better developed archives and those which are less well resourced. It will establish a working group, undertake an audit of expertise, organise a workshop for CESSDA members and draft a report based on the results of the audit and the workshop. The aim is to extend the CESSDA network, in terms of membership, professionalism and skills and in terms of associations with similar organisations. The precise differences between CESSDA archives will be examined to identify the gaps (both technical and organisational) that will need to be filled to bring all organisations (both existing members and those which may seek to join at a later stage) to a common standard. This work will be complemented by the development of a self–assessment tool which will enable for all archives to measure their conformance with the OAIS Reference Model. The combined effect of this work will be to raise standards within the community.
The Widening – Exploring Potentials and Possibilities of Extending the Research Infrastructure
Brigitte Hausstein (GESIS-SAEE)
The project will focus on the need for widening the participation in the CESSDA RI both directly by fostering membership in new countries and indirectly by deepening involvement and extending the CESSDA RI to agencies and organizations which remain outside of CESSDA yet continue to host important data collections. Although the current CESSDA network is extensive, including 21 countries, it is not totally comprehensive. Equally, the CESSDA network is currently rather heterogeneous, with some country members being younger and less-developed. These imbalances will be addressed. A special action plan will be set up in order to extend the existing CESSDA network and foster the development of national data archiving initiatives in those countries which are not currently part of CESSDA. The obvious purpose of this activity is to spread the CESSDA-network to each EU-member state and to create and maintain a ‘complete’ pan-European SSH network, including representation from emerging and candidate countries. Equally, some countries with organizations within CESSDA have specialised research-led project-based teams whose data currently fall outside of the CESSDA network, and mechanisms likewise need to be put in place to create better comprehensiveness and coherency.
2008-05-30: H2: New Data, New Tools: the State of Software Development at the Minnesota Population Center
A Unified System for Processing Microdata Projects with Disparate Hierarchical Data Models
Justin Coyne (University of Minnesota)
Historically the Minnesota Population Center (MPC) has dealt with 2- tier (person-household) hierarchical datasets that originate primarily from census data. Recent projects have data hierarchies that go well beyond the limits of the systems developed to process the 2-tier data. This has necessitated development of a new system that would be able to handle data structured in a multi-tier configuration. We contrast the structure of two datasets (American Time Use Survey and Integrated Health Interview Survey) and explore how our unified system addresses the unique challenges posed by each dataset. We will examine the techniques we used to identify and classify hierarchy, translate flat datasets into relational database, and creating queries with the relational data model. We offer a brief overview of the 4 stage pipeline that we use to process data from collection, integration, to extract back into a flat data file.
Domain Specific Languages for Data Editing
Colin Davis (University of Minnesota)
The Minnesota Population Center (MPC) offers microdata with constructed variables not found in the original datasets. These variables represent an important addition to the data disseminated by the MPC. However, recent MPC projects have involved datasets with more complex structure, necessitating the development of new tools to import, integrate and disseminate these datasets. The new toolset performs variable construction and complex data editing as the final phase of variable integration. The tools accomplish this by executing scripts that apply edits to variables. The editing scripts are meant to be easy for a researcher without a programming background to read and modify, and are written utilizing a domain-specific language, or DSL. The MPC has utilized a rudimentary form of a DSL for the data editing of US Census micro-data. For datasets with more richly structured data, a similar but more expressive and powerful domain-specific language was created. We examine the two styles of the data editing DSL and look at the challenges posed by the form of the American Time Use Series data and the Integrated Health Interview Survey data compared with traditional census data.
Building an Extensible Data Access System for Longitudinal Surveys
Marcus Peterson (University of Minnesota)
Though traditionally dealing in simply structured microdata, the Minnesota Population Center (MPC) has recently begun to move toward handling more complex types of demographic data. One such project involves the harmonization and dissemination of the National Survey of Families and Households (NSFH), a longitudinal dataset containing some 13,000 respondents surveyed over 16 years. In developing a database- driven, web-based access system for this dataset, the MPC aims to build an interface capable of disseminating a broad range of other similarly complex datasets. The NSFH dataset introduced several new programming challenges for the MPC, most relating to the presentation and storage of longitudinal data comprising more than 27, 000 variables. Initially working with an assorted collection of codebooks, DDI and data files, the MPC developed tools to import the NSFH metadata and data into a database that is easily queried from a web application. By making this disparate data accessible through a database, the MPC has taken a big step toward realizing an access system suitable for this and other equally complex datasets. Here we will discuss the building of the data access portion of the NSFH dissemination system.
Time Well Spent: Building a System for Time Use Research
Benjamin Ortega (University of Minnesota)
The Minnesota Population Center is working with the University of Maryland's Joint Program in Survey Methodology to develop a data access system for the American Time Use Survey, a collection of time diary information from participants in the U.S. Census Bureau's Current Population Survey (CPS). This unique dataset offers potential for research on topics including work and family time, social policy impact, and many others, but the data's inherent complexity has limited its use so far to a small group of researchers. With the aim of facilitating research use of this data, we are building a system that combines comprehensive metadata and documentation with a tool that lets researchers work intuitively with time use and survey data. Our system enables researchers to specify customized aggregations of time spent involving particular activities, locations, times of day, and other criteria. Users can then view this data alongside CPS responses. In this talk, we will discuss our efforts to create the end-user component of this system, drawing on the MPC's previous experience building such tools while incorporating traditional and emerging trends in software architecture and usability to address the new challenges presented by the disparate qualities of this data.
2008-05-30: H3: Establishing Data Archives in Developing Countries: Some Initial Steps
The Accelerated Data Program (ADP) in Latin American Countries