W4: Building a Data Library or Data Observatory on the Web Using Nesstar Technology
Jostein Ryssevik (Nesstar, Ltd.)
Margaret Ward (Nesstar, Ltd.)
Cliff Dive (Nesstar, Ltd.)
This workshop will focus on the various steps in the process of setting up and running a data library or observatory on the Web. Issues to be addressed include: a) setting up and configuring a Nesstar server, b) customizing the interface, c) adding data and other digital resources, d) defining and managing the access control and e) building a virtual library by linking several servers. The workshop will be of a hands-on nature and will include visits to a variety of live data services. The users will also set up, customize and populate their own server during the workshop.
W5: Data Publishing with Nesstar Publisher
Margaret Ward (Nesstar, Ltd.)
Jostein Ryssevik (Nesstar, Ltd.)
Cliff Dive (Nesstar, Ltd.)
This workshop will focus on data and metadata preparation and publishing by means of Nesstar Publisher. Nesstar Publisher is a comprehensive data management tool that allows the user to extract data from a variety of formats and systems, author and edit DDI-compliant metadata and load data and metadata onto a Nesstar server (or any other DDI-compliant data platforms). The focus will be on survey-data (also complex hierarchical studies) as well aggregated data or cubes. Even publishing of maps (GIS-resources), reports and other documents will be covered. The workshop is a follow-up to workshop 4, but can also be useful for persons without prior knowledge of Nesstar. The workshop will be hands-on and participants are encouraged to bring their own data.
W6: Using Streaming Geospatial Data Sources
Steve Morris (North Carolina State University)
Guy McGarva (EDINA, University of Edinburgh)
James Reid (EDINA, University of Edinburgh)
In the past few years new streaming geospatial data sources have become available, allowing users and their applications to interact with remote geospatial data resources and services. These services are based on proprietary technologies such as ESRI’s ‘image server’ and ‘feature server’ as well on open technologies such as the Open Geospatial Consortium specifications. The most common examples of the latter include WMS (‘Web Map Service’) and WFS (‘Web Feature Service’) specifications although other specifications are being published and used such as Web Coverage Service (WCS) and the Catalog Interface (CAT). This workshop will focus on consumption of such data sources and services, with an eye to integrating these new resources with more traditional file-based data offerings. Topics to be addressed include: demystifying the alphabet soup of WMS, WFS, WCS, GML, etc.; identifying and evaluating some existing streaming data sources; discussing the advantages and pitfalls of using streaming data in project work and research; and highlighting challenges related to integration of streaming data with traditional file-based data in catalogs and metadata databases. The discussion will include hands-on examination of some existing streaming geospatial data services. While the workshop will primarily focus on consumption of such services, a brief overview of approaches to publishing streaming data will also be provided and the issues arising from doing so. Also to be considered is the challenge posed to data preservation by the elimination of data file acquisition as a necessary precursor to providing data access.
W7: DDI 102: Codebook Creation and Beyond
William Block (Minnesota Population Center)
Mary Vardigan (ICPSR)
This workshop is a follow up to last year's successful introductory workshop on the DDI. This workshop will begin with a brief introduction to DDI and XML and then move on to a hands-on exercise in which participants create DDI codebooks from actual documents they have brought from their local settings. The bulk of the session will be hands-on entry by participants, resulting in a DDI-compliant file that can be taken home. Freely-roaming instructors will be available throughout the hands-on portion of the workshop to answer questions and offer advice, and DDI questions at any level of expertise are encouraged! The workshop will conclude with suggestions on how to markup documents more efficiently as well as a look toward what is coming with DDI 3.0. You might consider taking this workshop if: you are new to the DDI and would benefit from an overview of the basicsyou have questions about how to mark up documentation from your local collectionyou have questions about how to more efficiently mark up documentsyou are wondering where the DDI is going with version 3.0 and what that means for DDI documentation created in an earlier version For this session, participants should bring a codebook file in MSWord or ASCII on a CD or USB flash drive. Participants are also encouraged to bring a paper copy of their codebook, in case of technical difficulties on the morning of the workshop.
2005-05-25: Plenary I
The Need for Rigour and Accessibility in Comparative Research
Roger Jowell (Professor of Sociology and Director of the Centre for Comparative Social Surveys, City University)
The European Social Survey (ESS) was set up in 2001 as a new quantitative time series. Its twin aims were to monitor trends in social values and to raise methodological standards. It is now about to start its 3rd biennial round with funding from the European Commission, the European Science Foundation and 25 National Science Foundations all over Europe. Its results are made available on line as soon as they are available with no privileged access. Roger Jowell, the Coordinator of the ESS, will describe the origins and methodology of the project, commenting on what he sees are the 'imperatives' of modern-day cross-national time series.
2005-05-25: A1: Cross-national Socio-economic Data: Boundaries of Evidence
Understanding the United Nations Millennium Development Goals indicators -- how to find and interpret the evidence on target achievement
Robert Johnston (United Nations Statistics Division)
The eight Millennium Development Goals and eighteen targets were adopted by consensus at the United Nations Millennnium Summit of the General Assembly in 2000. In just a few years, they have become the strategic centerpiece of international development assistance and support to the world's poor countries, and the measure of differences between the poor and the rich. The goals and targets were specified to a considerable extent in quantifiable terms and reflected recommendations from the previous decade of global conferences on environment, status of women, social development, population and development,education for all and international finance, among others. In adopting the Declaration the General Assembly also asked the Secretary-General to set up an evidence-based system to monitor progress towards achievement of the goals and targets at national, regional and international levels, through an annual global report and an extensive network of national reports. At the regional and international levels, these and related reports are based on data at the United Nations Statistics Division's Millennium Indicators Database at http://millenniumindicators.un.org. The proposed presentation would take a few of these indicators--such as extreme poverty, gender parity in school, maternal mortality, HIV/AIDS and atmospheric C02--and demonstrate, preferably online if a connection is available, how the data can be retrieved from the database and interpreted. The emphasis will be, through specific examples, on how to get maximum information from limited data without taking the data beyond the serious limitations of reliability and availability typical of international statistics. Researchers should approach the data with a healthy skepticism and be prepared to look further to understand their sources, methods and limitations. The more they know about the data, the better they can put together their research strategy and the better and more reliable their hypotheses and conclusions will be.
Cross National and Intergovernmental Data: Paying for one stop shopping.
Bobray Bordelon (Princeton University)
Governmental and intergovernmental data can be found in many formats. The interface provided by the originating body is not always user friendly. Researchers often seek one stop shopping and want to rely on one interface that combines data from many organizations. Some institutions have developed local solutions while others often rely upon commercial vendors such as Datastream, Global Insight, the Economist Intelligence Unit, or Bloomberg. This paper will explore the various approaches made by commercial vendors to provide coverage, comparability, methodology, and access.
The production and presentation of statistics of unemployment: comparability issues
John Adams (Napier University)
Ray Thomas (Open University)
The United Nations publish unemployment statistics for 123 countries. Most of these statistics are based on International Labour Office (ILO) criteria for the definition of unemployment, but many countries also produce unemployment statistics based on insurance records and on the basis of registered unemployment. The paper aims to compare the main features of different series. The dimensions compared include the conceptual basis for the definition of unemployment, use of denominators for production of unemployment rates, boundaries with employment and inactivity, entry statistics and duration of unemployment, and the cultural influence of the statistics. The paper identifies conflicts between achieving international comparability and national needs. Survey statistics that underpin international comparisons do not support geographically detailed analysis within countries. The value of ILO statistics also is limited by failure to recognise the concept of entry to unemployment and difficulties of integration with other unemployment statistics. The standard LFS questionnaire could be modified to support the production of statistics for entrants to unemployment. The sampling frame could be modified to ensure consistency with nationally produced insurance or registered unemployment statistics. (The research for this paper has been supported by a grant to John Adams from Scotecon and by the award of a Campion Fellowship by the Royal Statistical Society to Ray Thomas)
The World on a plate: building and supporting a new community of international data users
Keith Cole (ESDS International, University of Manchester)
The increasing importance of cross-national research has resulted in an astonishing growth in the usage of international data over recent years. ESDS International provides academics across the UK with access to the major international databanks produced by international governmental agencies such as the United Nations, International Monetary Fund and World Bank. In this paper we describe how an emerging new user community for these databanks has been developed and supported, and the challenges of responding to a diverse and expanding user group. We report some of the problems users face accessing and using the data in their teaching and research and how these barriers can be overcome. Finally, we examine how the rapid growth in usage in the UK is reflected in current global trends and the potential demand for access to these databanks worldwide.
2005-05-25: A2 : National Initiatives in Coordinating Preservation: Working Together
Data-PASS/NDIIPP: A new effort to harvest our history
Caroline Arms (Library of Congress)
The preservation of digital content has become a major challenge for society. In December 2000, the United States Congress appropriated funds for a national digital-strategy effort, to be led by the Library of Congress. The Library responded by creating the National Digital Information Infrastructure and Preservation Program (NDIIPP). This paper will provide background information on the project, and outline NDIIPP activities. It will also introduce the goals for one aspect of the program: a group of preservation partnerships, including both Data-PASS and North Carolina Geospatial Data Archiving Project.
UK strategies for digital preservation and digital curation
Chris Rusbridge (Digital Curation Centre, University of Edinburgh)
The UK has long been active in digital preservation. Following a series of international workshops, the Digital Preservation Coalition was set up in 2001. Meanwhile, other parts of the UK were becoming heavily engaged in e-science, and bearing these developments in mind, the Digital Curation Centre was formed in 2004. These two organisations are founded and funded on very different principles and scales, and are looking at related problems from different angles; nevertheless, they aim to complement and mutually support each other. This session will provide some context and background on the two organisations and their approach, leading towards a discussion of relevant issues from an IASSIST perspective.
North Carolina Geospatial Data Archiving Project/NDIIPP: collection and preservation of at-risk digital geospatial data
Steven P. Morris (North Carolina State University Library)
The NCSU Libraries is partnering with the North Carolina Center for Geographic Information and Analysis on a three-year project to collect and preserve at-risk digital geospatial data resources from state and local government agencies. This project is being conducted under a cooperative agreement with the Library of Congress in conjunction with the National Digital Information Infrastructure and Preservation Program. Although the effort will focus solely on North Carolina, it is expected to serve as a demonstration project for other states. Targeted resources include digitized maps, geographic information systems (GIS) data sets, and remote sensing data resources such as digital aerial photography. State and local agencies frequently offer more detailed and up-to-date geospatial data than federal agencies. However, these entities are by definition decentralized, and their dissemination practices focus almost exclusively on providing access to the most current data available, rather than any older versions. The project partners will develop a digital repository architecture for geospatial data through use of open source software tools such as DSpace and emerging metadata standards such as Metadata Encoding and Transmission Standard (METS). In addition, the partners will investigate application of emerging Open Geospatial Consortium specifications for data interoperability in the archive development process. Specific technical and organizational challenges will be discussed.
Data-PASS/NDIIPP: A new effort to harvest our history
Darrell Donakowski (ICPSR, University of Michigan)
In 2004, six major social science data repositories in the United States joined together in a partnership with the Library of Congress to work on ensuring the long-term preservation of their holdings and of materials that they have not yet collected. This presentation will provide an update on this innovative partnership: its current status, what it hopes to achieve, and how organizations like IASSIST can help it succeed. The challenges in identifying "at-risk" data and the process of selecting data for the project will also be presented. We also hope to have a discussion with participants to gather additional ideas of data sources and other organizations that would join our efforts.
2005-05-25: A3 : Enlightened Policies: Improving Collections and Acquisitions
Collecting evidence about studies to guide acquisition policy
Janez Stebe (Social Science Data Archive, University of Ljubljana, Slovenia)
The bulding of a national data archive collection is often pragmatic and thus luck a proper legitimacy. We intend to critically evaluate the situation in Slovene ADP and present some efforts to improve that. We have initiated to collect more systematic evidence on studies that could potentially be selected for further processing. This will hopefully broaden the range of topics covered and ensure that we don't miss more relevant studies. We will discus the problems that we dealt with in the process: Which information about studies to include that could guide study selection. How to evaluate, if acquisition criteria about data such as general data quality, relevance and new areas that they cover, met with users expectations? And, finally, after locating and selecting potential new studies, how to motivate potential depositors to collaborate in the archiving process?
Setting up acquisition policies for a new data archive
Sami Borg (Finnish Social Science Data Archive, University of Tampere)
Helena Laaksonen (Finnish Social Science Data Archive, University of Tampere)
The size, age and the national setting affect remarkably acquisition policies of data archives. Some are happy about every data set that they manage to get, the others must be selective. We describe the development of FSD's data holdings and acquisition policies, particularly highlighting problems we have come across in practice. We elaborate the issue by answering most questions outlined in the session abstract, especially from the point of view of a new archive. A good working strategy for safeguarding sufficient flow of data requires investments in general promotion of comparative and secondary research, and a lot of encouragement of researchers to leave their data in archive. We discuss the methods to acquire data sets that are used most frequently, and what type of data has been used the least. Also, we try to classify the types of action needed to ensure success in getting the data in, and keeping both researchers and data providers pleased. In addition, the presentation discusses briefly how the science funding organizations have acknowledged the need to support acquisition of the national data archives in Europe.
Redesigning and formalising national Data Archives' collection development policies
Amy Pienta (ICPSR, University of Michigan)
Louise Corti (UK Data Archive, University of Essex)
The aim of this paper is to describe the processes of revising, formalising and maintaining a collection development policy for two major national data archives. Both the UK Data Archive and ICPSR have a 35+ year history during which both have seen a number of changes in their acquisitions policies and in the way in which data are acquired. In the first part of the paper, the experiences of ICPSR's recent phased policy revision in 2004-2005 is described that has resulted in a new policy approved by eminent social scientists and archivists from around the world. In the second part, the paper sets out the UKDAs' work to revise its Collections Development Policy to encompass the needs of the distributed national data service, the Economic and Social Data Service. UKDA are also now working closely with the national Research Councils to operate, more efficiently, official datasets policies both for Research Programmes and Research Grants. Work from a recent project that is supporting data creation and data management for a high investment cross-disciplinary thematic research programme that is generating some interesting yet complex data is highlighted. Both Data Archives describe their internal strategies and procedures that have been used to review existing policies and identify current best practices, trends and developments, with a view to reworking their own policies. They pay attention to detailing the internal processes that include setting up new procedures and groups to assess, oversee and monitor potential and incoming national and international data sources.
Identifying quality acquisitions from a data deluge
Zoe Bliss (AHDS History)
This paper will discuss the current acquisitions strategy of AHDS History, one of the five centres of the Arts and Humanities Data Service (AHDS), and outline how it currently identifies, evaluates and accessions data into its collection. In particular it will describe the means employed by AHDS History in conjunction with the AHRB, to ensure that historical digital data created as a result of AHRB funding is offered for deposit with the AHDS. The paper will also discuss the competing influence of funding bodies, creators and users on a robust acquisitions policy.
2005-05-25: B1 : Cross-national Social Data: Building Common Ground
NGO and IGO funded surveys: lessons from Vietnam
Daniel Tsang (University of California, Irvine)
Modernization in Asia has created the urgent need for reliable data about socio-economic developments in each country and region. Although statistical agencies, IGOs and NGOs have commissioned numerous surveys about living standards, public opinion and social behavior across the continent, much of the data collected have not been made public, or are only accessible to a select few. Where data files are available, onerous access conditions make accessibility difficult. Even NGO- or IGO-funded surveys are not usually made public or deposited in an accessible data archive. Based in part on field research in Vietnam in 2004 as a Fulbright research scholar, the author, a social science data librarian, surveys the state of social science archiving primarily in Vietnam and the prospects for encouraging and institutionalizing data sharing. This paper addresses various possible solutions to this problem of data sharing and the implications for social science research.
Data archive in developing countries: preservation and dissemination of microdata as an instrument for better development results
Olivier Dupriez (The World Bank, Development Data Group)
Poverty alleviation has become the overarching objective of many development strategies. Consequently, the demand for socio-economic data for designing, targeting, and monitoring development policies and programs has been constantly growing. Considerable investments have been made in developing countries to collect data. But the return on these investments is far from optimal. National data producers often have weak analytical capacity, and access to survey datasets by secondary users is limited due to political, legal and technical constraints. Most datasets are inadequately preserved, documented and disseminated, and remain under-exploited. To foster better use of existing data by promoting the adoption of international standards and best practice of data preservation, documentation and dissemination, the World Bank and the International Household Survey Network are developing guidelines and tools (such as the DDI-based Data Dissemination Toolkit) and are providing technical and financial assistance in data archiving to statistics producers in developing countries. The paper will present the background and rationale of these activities, and describe how the tools being developed and new partnerships with the dataarchive and research community in developed countries could contribute to better development results.
The vision is that Member States will someday use a common database system for tracking human development indicators. This database system will contain high-quality data with adequate coverage and depth to sustain good governance around the agenda of achieving the MDGs. DevInfo is a database system endorsed by the UN system that can help realize this vision. It is a general purpose database system designed for the collation, dissemination and presentation of human development indicators. The technology has been specifically designed to support governments in MDG monitoring. In addition, the system can be adapted to include additional user-defined indicators linked to national monitoring frameworks. By serving as a common database, DevInfo can be used to add value to national statistics systems by complementing existing databases and bridging data dissemination gaps. DevInfo presents data in tables, graphs and maps to help countries report on the status of the MDGs and to advocate for their achievement through evidence-based policy development.
2005-05-25: B3 : Building Data Services: Evidence from the Users
New user needs will change 'best practice' of data archive services
Irena Vipavc Brvar (Slovene Social Science Data Archive)
Slovene Social Science Data archive is relatively young archive and was for that reason able to follow "best practices" of established archives. Never the less, use and usability of offered services is changing with new technologies and that is why practice, that was once very good, needs to be changing as well. Regular users of data archive services have different needs than new users. Decision was made to interview both groups of users to find out first how to improve ADP work and services provided (e.g. improving accessibility of information on a web page) and second, to find out for what purposes are data and documentation mostly used - to find the most common user group to target. At the same time we provided information about services that ADP offers and metadata information about surveys that are available. Our experience is that part of provided information is rarely used, because users had no knowledge of its existence or practicability. Gathered information will be used to help plan future services and to prepare seminars for target groups.
Jane Fry (Data Centre, MacOdrum Library, Carleton University)
Ernie Boyko (Nesstar)
In an ideal world, a data centre should be able to hire trained professionals at variuos levels of expertise and senority and provide them with a career path. The reality is that training for data professionals in universities is limited, poeple change jobs (and thus leave vacanies)and career paths entirely in the data sphere are limited. This session will describe Canada's national training and mentoring process and will give an example of students are trained to become part of a team that that offers a full range of data services. This is the process by which the expertise required to staff Canada's 67 data centres is being developed.
Building a data archive that meets the needs of both researchers and non-researchers: how CPANDA addresses this challenge
Larry McGill (Cultural Policy the Arts National Data Archive, Princeton University)
The Cultural Policy the Arts National Data Archive (CPANDA) at Princeton University was created in 2003 to stimulate the development of a nascent field of study - arts and cultural policy research. As a field-building enterprise, CPANDA seeks not only to archive relevant data sets, but also to spur new research, encourage emerging scholars to take an interest in the field, and inform journalists, policy makers, artists, cultural organizations, and the public about the data it collects. Meeting the needs of such a diverse set of potential users poses significant challenges for organizing and creating web content, building tools for accessing data, and educating non-researchers on the proper interpretation of statistical data. At the same time, CPANDA must also meet the needs of its primary constituency, the research community; it is, after all, a data archive. This presentation will discuss how CPANDA structures users' online experience so as to meet the data and information needs of individuals with widely differing backgrounds and research expertise.
2005-05-25: C1 : The Life Course of Survey Data: Evidence from New Tools
Demonstration of a Blaise Instrument Documentation System
Gina-Qian Cheung (Institution for Social Research, University of Michigan)
This presentation will focus on features of the system, which produces DDI-compliant XML-based codebooks and questionnaires that may be printed or viewed as Web pages. Also discussed will be the challenges of parsing Blaise metadata information to document clearly question universes, question variable text, questionnaire skip logic, etc., and to display question text in multiple languages.
Demonstration of the interactive codebook for the National Survey of Family Growth
I-Lin Kuo (ICPSR, University of Michigan)
Documentation of public-use data files frequently differs substantially from the information stored in original CAI instruments in order to address possible confidentiality concerns and to provide researchers with tools suitable for data exploration and analysis. This presentation will explore the role that XML documentation plays in such an environment and how a data producer moves from original questionnaire to final public-use documentation.
This presentation will demonstrate what Blaise as a system provides in terms of instrument documentation (which was very limited until the TADEQ project started), with a focus on XML products. The idea behind Blaise for the last several years has been that Blaise should be open enough to allow users to extract all metadata and transform it into the desired format, as shown by the SRO system.
The U.S. Census Bureau worked with the Computer-Assisted Survey Methods (CSM) program in Berkeley to develop a system for documenting CASES instruments. Examples of the instrument documents can be viewed at: http://sda.berkeley.edu/idoc. The documentation programs parse an instrument into a structured element file, which will be convertible into DDI when a standard for instruments is eventually adopted.
2005-05-25: C3 : New Insights in Providing Data Services: A Variety of Evidence
Improving social science data and statistical services through assessment
Joel Herndon (Duke University, Perkins Library)
Alexandra Cooper (Duke University, Social Science Research Institute (SSRI))
Duke University has social scientists housed in (at least) 15 different departments, programs, and schools. It also contains numerous interdisciplinary centers conducting applied research projects based in full or in part on the social science disciplines. Thus, Duke provides social scientists with an environment at once robust and complex, with a wide variety of computing facilities and data service points. In the spring of 2005, Duke's Social Science Research Institute in collaboration with Duke Libraries conducted a joint survey of Social Science faculty and graduate students to determine 1.) the usage of data/statistical services on campus 2.) the level of awareness of data resources/statistical computing on campus and 3.) the need for additional data resources. Additionally, the survey attempted to determine the level/need for statistical training on campus. This presentation (and paper) provides a summary of our survey's findings and an analysis of the implications for data and statistical services at Duke and other research institutions.
Data libraries frequently exist within larger institutions, such as universities, government agencies and research institutes. As a consequence, management decisions about planning, funding, staffing and services are made by managers of the parent institution, rather than data librarians. Lack of management understanding about the nature and contributions of data libraries and their staff may result in difficulties in acquiring necessary resources. How can this situation be improved? This paper presents the view from a library manager’s perspective, and outlines an advocacy agenda and tools for increasing the visibility of the data library within the parent institution. Examples are drawn from both the library and advocacy literature and will propose strategies for getting on the administration’s agenda. It addresses the conference theme through demonstrating the value of enlightening decision-makers about the value of data libraries.
Data archiving at the US Central Bank
Linda Powell (Board of Governors of the Federal Reserve System)
As the central bank of the United States of America, the Federal Reserve System consumes vast quantities of economic, financial, and organization structural data. These data are used for making monetary policy, conducting banking supervision, performing economic research, and implementing consumer protection policies. The focus of this paper is on micro data archived at the Board of Governors of the Federal Reserve System. The paper discusses the types of data used by the central bank, how data are collected and edited, data documentation and metadata, and data purchased from commercial vendors. The paper discusses the challenges faced by archiving a diverse pool of data including communication and coordination, user access across various computer platforms, and meeting the diverse needs of a variety of end users. Finally, the paper discusses some of the solutions to the challenges faced and how technology is facilitating the growth of data archiving.
2005-05-26: D1: Data Shaping the Neighbourhood: Localised Insight
Scottish Neighbourhood Statistics and the Scottish Index of Multiple Deprivation
Tracey Stead (Office of the Chief Statistician, Scottish Executive)
John Fraser (Office of the Chief Statistician, Scottish Executive)
Robert Williams (Office of the Chief Statistician, Scottish Executive)
Scottish Neighbourhood Statistics (SNS) is the Scottish Executive's on-going programme to improve the availability, consistency and accessibility of small area statistics. SNS has developed a wide range of socio-economic data sets on a new consistent statistical geography called data zones. SNS is being used to inform the Executive's approach to improving the quality of life for people living in Scotland and especially in the most disadvantaged areas. The information is invaluable to Community Planning Partnerships (there are 32 such partnerships across Scotland) where the availability of quality information is crucial to the way in which services are developed and delivered and issues of local concern are addressed. The presentation will cover background, data development, use of data and geography in policy, particularly illustrated by the Scottish Index of Multiple Deprivation (SIMD) 2004.
Barriers and opportunities for remote access to farm business and farm household data
Philip Friend (Economic Research Service, United States Department of Agriculture)
The Agriculture Resource Management Survey (ARMS) is an annual survey of U.S. farm and ranch operators and it is a primary data resource for a huge array of economic analyses. The Economic Research Service (ERS), collaborating with the National Agricultural and Statistics Service, two Agencies of the United States Federal Government, has attempted to respond to increasing demand for access to this data. However, the survey is conducted under a pledge of confidentiality that allows use of the data only for the purpose of statistical analysis. This raises significant barriers to allowing remote access to the ARMS data. Presented with this conflict, ERS sought to use new technologies to enable the Agency to provide easier access to the data while ensuring its confidentiality. The restricted access extranet application that the ARMS team developed was deployed in September, 2004, and a public version of the tool was deployed less than a month later. This paper will describe both the tool and the process by which it was successfully developed.
This paper draws on experiences gained during the production of a spatial dataset characterizing rural England in terms of both socio-economic and environmental features. The base units of analysis are lower level Super Output Areas (SOAs) that are relatively consistent in terms of population size and allow the release of socio-economic data unavailable at smaller output area levels. Other advantages include boundary stability over time and the ability to nest SOAs within key administrative boundaries. SOAs are, however, specifically designed for social census data. Environmental data are more often collected in 1km or 100km grid squares or at points. This paper concerns the challenges arising from the integration of data from the social and natural sciences. Problems of boundary intersections, scale effects, geographic and statistical errors, data holes and the implications of the Modifiable Areal Unit Problem (MAUP) for the resulting dataset are discussed in detail and the importance of metadata for the rural typologies is outlined.
2005-05-26: D2: Enriching Metadata: the Lifecycle Perspective
Survey metadata documentation
Sue Ellen Hansen (Institution for Social Research, University of Michigan)
There are many reasons to capture metadata about the survey life cycle, including to facilitate replication, reduce incorrect use of data and facilitate secondary analysis, reduce administrative burden, meet contractual obligations, and ease archiving of survey information and materials. There are an equal number of obstacles to capturing metadata, including time and cost constraints, the complexity of computer assisted survey systems and instruments, and the lack of adequate tools for documenting the survey life cycle. ISR at the University of Michigan and ZUMA have collaborated on the development of a web-based Survey Metadata Documentation System (SMDS) designed to facilitate documentation of a survey's lifecycle, from initial design through data collection, and post-survey processing and archiving. This paper will describe the design and structure of SMDS, which has eleven data entry modules. Modules can be completed in any order and by multiple users, allowing the person most knowledgable about each particular phase of the survey to enter the data. The use of such systems to standardize metadata capture and develop comparative survey documentation (across countries, languages, survey waves, etc.) will be discussed.
Providing context for understanding: the data life cycle
Elizabeth Hamilton (University of New Brunswick)
The identification and capture of products generated over the data life course are critical to documenting the history of a survey. During the stages from identification of a data gap through to data analysis and interpretation, large surveys generate many different products relating to the design, data capture, and processing of the data. Training manuals, for example, provide information on the conduct of an interviewer and, in some cases, that knowledge is critical to the interpretation of the survey results. Reports with recommendations arising from the field testing of survey methodologies and instruments highlight limitations of a survey methodology in more detail than are normally present in the user manual. Using the examples of recent Statistics Canada surveys, such as the National Population Health Survey and the Canadian Community Health Survey, this paper will examine some of the products of the data life course. In archiving, documenting, and using survey data, IASSIST members should be searching out these products to permit a more meaningful understanding of the data.
Fitting the life course of the General Social Survey Cycle 17 in the Data Documentation Initiative
Irene Wong (University of Alberta)
A Canadian Research Data Centre pilot project was conducted to evaluate the use of the Data Documentation Initiative (DDI) standard with the confidential data file of the Canadian General Social Survey, Cycle 17. Among the objectives of this study was to assess how well DDI captured the life cycle of the creation and management of metadata within a major survey, including the initial planning stages all the way through to official announcements of products. This study sought to identify the variety of metadata tools used by Cycle 17's author division within Statistics Canada and to map the relationship between these documentation systems and the elements in the DDI standard. This paper reports on the findings of this pilot project.
The Xtensible Past: XML as a means for easy access to historical research data and a strategy for digital preservation
Annelies G.C.W. van Nispen (Netherlands Institute for Scientific Information Services (NIWI))
Rutger Kramer (Netherlands Institute for Scientific Information Services (NIWI))
This paper reports on the X-past project carried out by the Netherlands Historical Data Archive (NHDA). The main goal of the project is to investigate how the XML data format can improve the durability and access of historical datasets. The assumptions upon which XML is considered as durable are covered. The formatting of datasets in XML format is described. The X-past project investigated the possibilities to provide access to historical datasets by means of the ¹ÄúOpen Archives Initiative ¹ÄìProtocol for Metadata Harvesting¹Äù (OIA-PMH). This protocol uses the XML data format to express the syntax of verbs. Within the framework of the X-past project a prototype information system is developed as a proof of concept, based on which further system requirements have been defined. This paper will present the results of the X-past project and also look forward to its follow-up, Xara.
2005-05-26: D3: Tools to Support Data Services: New Approaches
The SDA online analysis system - recent enhancements
Tom Piazza (University of California, Berkeley)
The SDA system (Survey Documentation and Analysis) is used by many data archives to enable researchers to analyze datasets online. Enhancements to the system are being developed on a regular basis, and the current presentation will summarize recent work. The main topics to be covered by the presentation will be the following: a.. Charts for crosstabulated data and for other output b.. Calculation of confidence intervals for complex samples c.. Simplified user interfaces, as provided by the Quick Tables facility The session will also allow users the opportunity to ask questions about development plans and to make suggestions for future development of the system.
SOEPMENU: A menu-driven Stata/SE interface for accessing the German Socio-Economic Panel
Mathias Sinning (SOEPMENU)
John P. Haisken-DeNew (SOEPMENU)
This papers outlines a panel data retrieval program written for Stata/SE, which allows easier accessing of the German Socio-Economic Panel Data set. Using a drop-down menu system, the researcher selects variables from any and all available years of the panel. The data is automatically retrieved and merged to form a rectangular “wide file”. The wide file is transposed to form a “long file”, which can be directly used by the Stata panel estimators. The system implements modular data cleaning programs called plugins.
This session will cover methods of sharing of information across services currently used by several projects. This information principally concerns events, news and other activities both within and externally to the UK ESRC Research Methods Programme, Samples of Anonymised Records, and Government subservice of the Economic and Social Data Service. It briefly covers what technologies we use, but focuses on how we use them, the benefits they bring to us and our users, and the systems they work within and contribute to.
2005-05-26: D4: New Insights in Providing Data Services: A Variety of Evidence
Increased accessibility of datasets and statistical resources through faculty-library collaboration
Lynda Duke (Illinois Wesleyan University)
This paper outlines a unique collaborative approach used to bring together library and teaching faculty with library IT staff to better manage datasets and statistical resources. Librarians led a process to identify specific challenges, assess data resources available through the library and academic departments, determine user needs, implement and promote solutions, and assess student and faculty responses. Additionally, the presenter will discuss project outcomes including organizing and consolidating datasets, designing dataset and statistics web pages to direct users to appropriate sources, naming a librarian the Institutional Representative for ICPSR, and increasing student support through both individual appointments and instructional sessions. In addition, users are more aware of what resources are available, teaching faculty gained an expanded understanding of the library's mission and the expertise of its faculty and staff, the university has been able to utilize financial, time, and material resources more effectively, and librarians have developed a more intimate knowledge of student's needs, our collection, and how to work with datasets.
In 2004 the Bank of Canada implemented a meta data repository to provide economists, analysts, researchers and business data managers with easy access to information about the statistical data they use and produce in their work. The web based, bilingual (French and English) application allows staff members working with economic and financial data to easily determine what statistical data are available, to acquire information about these data such as their sources and to locate specific data objects (time series, formulas). The repository facilitates the re-use and sharing of statistical data resources by also capturing information on corporate applications and projects that use and create statistical data. This paper describes the reasons for the creation of the repository, the challenges experienced and overcome, the lessons learned and the rewards since implementation.
e-Government Information: the same old problem -- newly digitized
Alastair J. Allan (University of Sheffield Library)
Government information was, for the last half of the twentieth century, a research resource often overlooked by academics and librarians. With the wholesale migration of government information to digitized formats, the information is now far more readily available and, indeed, far more of it is available. Additionally the globalization of government information provides more detailed data but complicates the choices facing researchers. This paper looks at the type of information that supports e-government and its place in academic research. The difficulties of using the information, though, have not gone away and for some types new challenges have emerged. This summary examines the impact of the web on government information, traces recent developments and predicts future trends.
2005-05-26: E1: Transforming Social Data into Information
Information issues in health networked organisations: cooperative work and new relationships
Christian Bourret (ISIS, Université de Marne la Vallée)
In our Knowledge and Information-based Society, the rise of networks is becoming a key aspect of Healthcare. New networked service organisations have developed as an interface between primary care and hospitals. Their innovative aspects (organisational, human and technical) focus on new cooperative practices centred on patients who are more involved in their health. This paper will firstly examine the rise of Healthcare Networks from the French perspective and will compare that with experiences in United States, Canada, the United Kingdom and Spain in a background of management of the complexity (global vision both of the networked organisation and of the patients). Secondly the paper will outline specific aspects of Health data: personal, sensitive, confidential and conforming to particular legislation. Then the paper will study their use both for the monitoring of patients’ pathways and for the management of organisations with the essential role of Information and Communication Systems (information sharing and quality of data) with their key element: the patient’s Electronic Health Record. Finally the paper will demonstrate how supportive evaluation of these community service organisations helps develop their collective identity via a process of continuous improvement.
Bridging information and political science: investigating empirical evidence on political information seeking on the internet, 2000-2004
Alice Robbin (Indiana University)
For nearly three decades information technologists and democracy theorists have contended that the digital revolution would renew and invigorate political community. The early utopian cyberdemocrats were optimistic that information and communication technologies (ICTs) would provide a vehicle for effective political participation in the public sphere, by diffusing and improving access to and use of information for public decision making. Employing theoretical frameworks from information science and political science on information and political behavior and democratic theory, this paper examines the claims of the cyberdemocrats as they relate to the search for political information through the Internet. The analysis relies on a series of national surveys of the U.S. population conducted by the Pew Internet Life Project between 2000 and 2004. The results indicate that, in general, the Internet has had little effect on increasing the salience of politics for most people but provides an additional channel of information for people who are already active seekers of political information.
Digitising Dutch Censuses, 1795-1971; Preliminary results work in progress
Luuk Schreven (Netherlands Institute for Scientific Information Services (NIWI))
Since 1997 NIWI and Statistics Netherlands have been cooperating in an effort to digitize all Dutch censuses, ranging from 1795 to 1971. Thefirst results were published in 1999. Only when more funding becameavailable (in 2002) the work on this project continued. Last November NIWI and Statistics Netherlands were proud to present a second round of preliminary results during a special meeting at which the new website was presented. Work on the census digitization will continue throughout 2005 and (hopefully) finish at the end of this year. In this paper I would like to address three topics concerning our work on the digitization of census material; The new website contains not just data, but also metadata and background material. I will give a quick tour of our new website (www.volkstelling.nl, which by May will also be available in English). Work on an online Historical GIS is progressing. We hope to have a data mapping server up this year in which to present our census data. I will give an update on our efforts toward a Historical GIS. In 2005 our work will mainly focus on the documentation of the censuses. I will address our efforts to document and harmonize the census data.
2005-05-26: E2: Tools for Preservation: Integration and Assessment
Preserving and improving the access to large and complex household surveys
Jostein Ryssevik (Nesstar Limited)
Pascal Heus (World Bank)
Olivier Dupriez (World Bank)
Mark Diggory (Harvard University)
A household survey is an expensive, large and complex project producing a series of inter-related datafiles and a variety of documents and reports including questionnaires, sampling plans, data processing notes, table reports, user guides etc. Preserving, documenting and disseminating the relevant deliverables from a household survey is thus a major task that seldom is planned into the survey project and often is neglected. This is reducing the value of these important data resources for secondary and comparative analysis. The latest version of the Nesstar Publisher has been designed to meet the requirements of large and complex surveys allowing data producers and archives to document and preserve the complete set of artefacts from a data gathering project and to publish these collections on CDrom or to a Nesstar server. The Nesstar Publisher provides complete support for the DDI as well as Dublin Core and eGMS.
The Virtual Data Center (http://thedata.org), VDC, is a complete open-source (OSS), digital library system for the management, dissemination, exchange, preservation, and citation of virtual collections of quantitative data. VDC functionality includes everything necessary to maintain and disseminate collections of research studies. The system also provides extensive support for distributed collections and federated networks. The long-term goal of the VDC project is to increase the replicability of research by providing advanced tools to support exchange, citation, and preservation of research data. "The DataWeb" (http://www.theDataWeb.org) is a network of government and non-profit statistical databases with associated data access and manipulation tools. This system, developed by the U.S. Census Bureau with support from other federal agencies, provides free server and analytical software to agencies and organizations that want to participate in the policy oriented data Other planned future releases of the VDC will include, among many other features, extended support for "deep citations" of data subsets, support for extended data analyses, and support for operation of the VDC on other platforms. This briefing will outline a cooperative effort, aimed at enabling these systems to interoperate -- with a view to giving the user communities associated with each system integrated access to data and resources available within both networks: By having both systems installed on one host, one can provide access to a common data store via either network. In addition, the two systems will locate and exchange data within a federated environment using open standards including DDI, OAI, and open XSL-based metadata crosswalks.
An assessment of Virtual Data Center as a tool for dissemination and digital preservation of social science data
Harrison Dekker (University of California, Berkeley, Doe/Moffit Libraries)
The past year has been marked by a considerable amount of development on the open source Harvard/MIT Virtual Data Center software project. In this presentation, I hope to examine VDC in an unbiased fashion, explain what it can do and where the project is headed. Particular emphasis will be placed upon digital preservation theory and practice and the role VDC might play in this regard
2005-05-26: E3: Enlightening Access Control: New Methods
Issues in federated identity management
Sandy Shaw (EDINA, University of Edinburgh)
Interest in federated identity management schemes has grown considerably in recent years, with developments such as Liberty Alliance, Shibboleth, and WS-Security. The common goal of these schemes is to enable organisations to rely on digital credentials issued by partner organisations even if partners deploy different authentication technologies (such as passwords or digital certificates). When a user requests access to a protected resource, the user is redirected to their home institution where an authentication exchange takes place. If successful, the user is redirected back to the resource along with a set of security assertions signed by the home institution. Within the UK, the focus of interest is the Shibboleth model, developed by the Internet2 Middleware Architecture Committee for Education (MACE). A strategic decision has been made to adopt Shibboleth as the preferred solution for identity management for services in the JISC Information Environment, and for use in other contexts, including e-Science. The immediate issues are the development of trust relationships within and between federations, and to assure interoperability and widespread deployment.
This paper describes the investigation into an alternative and flexible access management system, based on Shibboleth, for three social science data resources. The UK Data Archive (UKDA) is the central registration hub within a dispersed network of eight, UK-based, digital data resource targets. As such, it requires target-to-target communication in order to apply the user’s registration credentials to each resource, thus removing the need for the user to register with each separately or to agree more than once to any special data-related conditions. This paper will describe the preliminary work being undertaken, via the ‘shibbolisation’ of three resources, to place this registration system within a federated model of access control. It will also describe the target-to-target communication required and the suggested Shibboleth-enhanced registration model which could be rolled out to all eight resources. It will identify initial issues discovered by the team and compare the proposed model with other access management schemes.
The Research Data Centre Program: A fundamental element of the social research infrastructure in Canada
Gustave Goldmann (Statistics Canada)
Informed decision making on social issues requires current, comprehensive and very well-targeted research. Societies face two primary challenges in order to respond to this need for timely information – access to relevant data and a corps of qualified researchers to conduct the analyses. As part of a response to the challenges that confront Canadian policy research, a network of Research Data Centres was formally launched in December 2000 with the opening of the centre at McMaster University in Hamilton, Ontario. There are currently 13 Research Data Centres located throughout the country, so researchers are not obliged to travel to Ottawa to access Statistics Canada data. At the same time, the centres are administered in accordance with all the confidentiality rules required under the Statistics Act. The Research Data Centres meet, in a single location, both the need to facilitate access to detailed micro-data for crucial social research and the need to protect the confidentiality and security of Canadians’ information. The research conducted in the centres generates a wide perspective on Canada's social landscape. The network expands the collaboration between Statistics Canada, the Social Sciences and Humanities Research Council, universities and academic researchers, and it builds on the Data Liberation Initiative (http://www.statcan.ca/english/Dli/dli.htm). It also is instrumental in training a new generation of Canadian quantitative social scientists.
2005-05-26: E4: Discovering a Profession: the Accidental Data Librarian
Looking for data directions? Ask a data librarian
Luis Martinez (London School of Economics Data Library)
Stuart Macdonald (Edinburgh University Data Library)
Quantitative data collection in the UK can be traced back to 1086 with the Domesday Survey however it is only in recent history that the acquisition, distribution and analysis of quantitative data in digital format has been possible. 1967 saw the establishment of the UK Data Archive at the University of Essex. The mid 1980s saw the emergence of Edinburgh University Data Library and Oxford Data Library (Nuffield College), and more recently the London School of Economics (LSE) Data Library and the LSE Research Laboratory Data Service. Based at tertiary education institutions these specialised libraries have evolved independently to assist researchers and teachers in the use of quantitative data for analysis and research purposes. Thus whether by design or accident the data librarian was born in the UK. In this digital age with increased IT literacy, technological exposure and expectancy the data librarian’s role is ever more confusing and difficult to identify. This paper will discuss the differing areas of expertise within the UK data libraries, the role of the Data Information Specialists Committee – UK (DISC-UK), in addition to the role played by other information staff which identify them as potential or ‘accidental’ data librarians from ‘non-data library’ institutions.
“You’re a what?”: taking stock of the data profession
Paul H. Bern (Syracuse University)
Now that IASISST is 31 years old, the time has come to reflect upon what a Data Profession is. Recent discussions about how one becomes a “data librarian” show that there is no single route to becoming one. But, is the data profession just data librarians? Is it even a profession? In this presentation, I plan to investigate what a “profession” is, using some of the many definitions that have been proposed. I will apply these definitions and characteristics to explore whether or not “Data” is a profession. Finally, I will propose some avenues by which we can strengthen the professional qualities of “Data.”
A significant amount of the data used by social scientists emanates from the U.S. Government sector through a number of agencies and programs such as Census, Labor, Education, Social Security, Health and Human Services, and more. Acquiring the additional responsibility for U.S. Government Information subsequent to having data responsibilities brought new perspectives to both areas for this librarian. Among them, a greater respect for the diversity of information, particularly data, produced and distributed by the federal government sector; an appreciation for Title 44 of the U.S. Code and its impact on information accessibility; and the challenges and opportunities for converting U.S. Government Information into an electronic, networked environment. Learning, from the 'documents' side, more about the federal government's nature and structure, new and traditional distribution channels, and nature of information available; have all improved the provision of data services to patrons.
Establishing a data service: The Numeric GeoSpatial Data Service Proposal
Tiffani Conner (University of Connecticut)
Creating a new service within an academic library can be an arduous process even for a seasoned data librarian. For this new librarian, with data services part of the job description and a renowned public opinion center located on campus, investigating, designing, and creating the new service required more than an investigation and reporting of the findings. This paper includes an environmental scan of Association of Research Libraries ranked institutions within 3 distinct peer groups, comparisons and evaluations between the subjects, a rationale for the Numeric and GeoSpatial Data Service and its proper placement within the organization, and a 4 year plan for the new service. Included in the paper are elements often overlooked in the proposal and planning stages, including training for the new data librarian(s), likely partners from the campus and community, technological tools to investigate, costs of time and equipment, and a vision for the future.
2005-05-26: F1: Timeless Social Data: Past, Present & Future
Industrial classification and the depiction of open source–based production data
Fernando Elichirigoity (University of Illinois)
Cheryl Knott Malone (University of Arizona)
The implementation of the North American Industry Classification System (NAICS) in 1997 signalled a dramatic and influential break with the past by replacing the Standard Industrial Classification (SIC) scheme in use until then to collect information about the American economy. NAICS aims at nothing less than capturing and measuring the information economy. While the change represents an important improvement over the previous system, it also illustrates the difficulty of organizing and aggregating data intended to describe a dynamic economy. To illustrate this issue we look at the ways in which an increasingly significant element of the economy, open source-based software production, is not really depicted or even depictable in NAICS. We argue that the underlying values embedded in NAICS leave no room for novel features of the new economy, such as production practices that interact with traditional forms of economic exchange but are not subject to its rules.
The history of the social survey – the social survey in history
Anne Sofie Fink (Danish Data Archives)
The social survey as we know it today has roots far back in modern history. It dates back to the idea of the Enlightenment that careful description of natural species, formations of clouds, the population etc. could give an insight that was worth getting. The paper will outline the history of the social survey starting from the studies of the poor in the UK carried out by philanthropists up till the social surveys carried out continuously in the modern welfare state. One aspect of this history is the formation of data archives in a wide range of nation states around the world since the 1960s. The data archives were given the obligation of storing and distributing social survey data sets, as governments acknowledged the great scientific value of social survey data sets. In many respects, to tell the history of the social survey is to tell the history of the development of the modern welfare state. The argument of the paper will be that the two have walked hand in hand. As we debate the challenges, limitations, and responsibilities of the welfare state so we as data archivists must be aware of the challenges, limitations, and responsibilities of the social survey as a tool for public administration and social science.
Measuring 'the quantum of happiness': ensuring access to the first ( second) Statistical Account
Peter Burnhill (EDINA National Data Centre Edinburgh University Data Library)
Ann Matheson (Hon. Editor, Statistical Accounts -- formerly Keeper of Books, National Library of Scotland)
When setting out to assess 'the quantum of happiness' in the late 18th Century, Sir John Sinclair, child of the Scottish Enlightenment and the first Secretary of the (British) Board of Agriculture, was the first to use the term statistics in its modern sense. His survey of 166 queries to each of the church ministers in the 938 parishes resulted in two Statistical Accounts of Scotland. The first covered the 1790s and the second ('New') covered 1830s; together they represent the best contemporary 'repeat survey' of life at the beginning of the first industrial nation: topics include wealth, class and poverty, climate, agriculture, fishing and wildlife; population, schools, and the moral health of the people. With the formation of the British State underway, the contrast in the presentation of numerical information in text and as tables in the two Accounts serves as a reminder that Sinclair can be credited with the foundations of the 'Blue Book' and of the tradition of 'official statistics' adopted widely today. Just over two hundred years later, in 1996, Henry Heaney (former Librarian of the University of Glasgow) secured support to fund a digitisation plan to protect the relatively rare and fragile volumes of the Statistical Accounts, which had become regarded as a key resource, and agreement that EDINA set up means to access the scanned pages. There followed subsequent keying of text, experimentation with cross-sectoral 'ownership', with name entity extraction from text and use of GIS, as well as focus on maintaining and developing access to a 'national treasure'. A revision of the user interface to a service, accessed by scholars, genealogists and the Scottish Diaspora world-wide, will be available as part of the presentation.
2005-05-26: F2: Metadata Enlightenment: Mark-up Standards and Issues
DDI and data
Hans Jørgen Marker (Dansk Data Arkiv)
Essentially the DDI is about documentation. The name says that much. But when you have data documentation, you propably also some data somewhere. Documentation would not make much sense otherwise. It would usually be possible to place those data in an xml-structure. Having gone so far you might want to have data an codebook in the same xml-document. This again is certainly a possibility (and the reference to the CALS tables.dtd in the current DDI shows that it is not a new idea). This paper will present an analysis of some of the issues involved in integrating data with the DDI and propose solutions to some of the problems. Hopefully this will create some discussion as there are a number of possible solutions and the best solutions will only be found though cooperation.
The Data Documentation Initiative (DDI) has been welcomed and even embraced by the IASSIST community. For those who sought a way of producing machine processable codebooks, it is a dream come true. For those who were searching for a data/survey preservation format, it seems to be an answer. But like all standards, it must have tools that support it and expanding user acceptance in order to survive and thrive. Some tools (such as NESSTAR) have been built to embrace DDI and others have been adapted to read DDI files. But is this enough for the DDI to survive, or better yet, thrive? Is there scope for applying these tools and standards beyond the data library/archives community? And what will be the impact of other (competing?) standards such as ISO 11179 and SDMX (Statistical Data and Metadata Exchange). This paper will explore these issues, attempt to clarify relations and speculate on possible future directions.
Smart qualitative data: methods and community tools for data mark-up
Louise Corti (UKDA, University of Essex)
Elizabeth Bishop (UKDA, University of Essex)
This paper will describe the ESDS Qualidata demonstrator project that forms part of a wider new Uk Research Council funded initiative known as the Scheme for Qualitative Data Sharing and Research Archiving(QUADS). The Scheme's aim is to develop and promote innovative methodological approaches to the archiving, sharing, re-use and secondary analysis of qualitative research and data, and that means thinking beyond the traditional centralised data archive model (ie ESDS Qualidata). The ESDS Qualidata project is exploring methodological and technical solutions for exposing digital qualitative data to make them fully shareable and exploitable. The project deals with specifying and testing non-proprietary means of storing and marking-up data using universal (XML) standards and technologies, and proposes an XML community standard (schema) that will be applicable to most qualitative data. The second strand investigates optimal requirements for contextualising research data (e.g. interview setting or interviewer characteristics), aiming to develop standards for data documentation and ways of capturing this information. The third strand aims to use natural language processing technology to develop and implement user-friendly tools for semi-automating processes to prepare qualitative data for both traditional digital archiving and to enable more adventurous collaborative research and e-science type exploitation, like linking multiple data and information sources. The project aims to further research tools for publishing (e.g. for web interrogation) and archiving enriched marked-up data and associated research materials.
2005-05-26: F3: Training for the Use of Data: Evidence from the Trenches
Introducing data history to students
Michelle Edwards (University of Guelph)
How many times have you heard “I’m an English major or I’m a History major, I don’t need to understand data!”? The use of data and its interpretation has not been exploited in today’s curriculum and has resulted in many students lacking the basic know-how when it comes to data use and interpretation. At the University of Guelph, the history and economics departments offer a course entitled “History by Numbers”, with the goal of introducing fourth year history students to quantitative data sources and to basic statistical concepts. This paper will discuss how students used the Nesstar WebView to access the 1871 Canadian Census data for an assignment, how they found navigating through the Census variables relatively easy and how some were easily frustrated when asked to create cross-tabulations and to interpret their findings. Despite the difficulties encountered by some, by the end of the semester there were several students who “discovered the world of data” and continue to use it to enhance their research papers and theses.
Training subject librarians to provide data services
Katherine McNeill-Harman (Massachusetts Institute of Technology)
Many academic data service librarians work among colleagues who specialize in particular subjects. These subject librarians are experts in their subject areas and maintain close relationships with users in their departments. Given the interdisciplinary nature of data, involving them in providing data services provides an opportunity to leverage their expertise and reach a broader base of potential users. Thus, the MIT Data Services Librarian initiated a project to train subject librarians on data services so that they could: - provide improved data reference service, - refer users to related campus resources, - improve coverage of data in their instruction sessions, and - discuss with their departmental faculty the Libraries' projects regarding data collections and services for data producers. The presenter will discuss the development and implementation of the program, plans for ongoing training, and suggestions for involving subject librarians at other universities.
Librarians who do not specialize in data often seem to find data reference mystifying. Nonspecialists often simply don't understand what data is or how it works. Lacking this basic knowledge, they are also unable to distinguish between different types of data or know which is most appropriate to a particular situation. We developed a short workshop to explain data concepts needed to effectively conduct data reference. A major focus was on explaining terminology and discussing the differences between different broad types of data - macro and micro, time series, cross-sectional and panel, media opinion polls and academic social surveys. We discussed how to recognize which type of data a patron needs, and gave pointers to major sources for each kind. In our paper we will discuss our ideas on explaining data reference tononspecialists, describe the workshop we gave, and discuss some of the feedback we received.
2005-05-27: Plenary II
Testing Social Change
John Curtice (Politics and Director of the Social Statistics Lab at Strathclyde University, Co-Director, British General Election Study, Deputy Director ESRC Centre for Research into Elections and Social Trends (CREST))
2005-05-27: G1: Topical Data Collections: Cultural Gems
Upgrading ABC News/Washington Post data collections using DDI and legacy databases
Mark Maynard (Roper Center for Public Opinion Research)
During the past several years the Roper Center has been integrating its online question-level retrieval system (iPOLL) with its catalog of dataset holdings. While critical steps have been successfully implemented on the study and file levels, much more could be done on the variable level to fully realize the research potential of these data resources. As a step toward this end, the Roper Center is working with ABC News and the Washington Post to produce fully documented data files for many polls conducted from 1979 to 1997. The metadata and system file generation project will seek to build upon the iPOLL question-level database by extending it to better reflect elements of the Data Documentation Initiative (DDI). The resulting variable-level metadata can then be used to create DDI-based XML files, SPSS syntax and system files. This presentation will describe the scope and requirements of the project, mapping of iPOLL database fields to DDI variable elements (Section 4), and the user interface for project and metadata management. Finally, an update on progress and a review of the potential for generalized utility of the system and lessons learned will be addressed.
Zoltan Lux (The Institute for the History of the 1956 Hungarian Revolution)
The photo-documentary database forms part of the contemporary history database at the 1956 Institute. It began to be compiled in 1997 after the Institute won competitive research and development funding from the Hungarian state. There is a constant need for photographic illustrations for the Institute’s printed and digital publications. Initially, digital photos were simply archived on cd-rom, with attached text files containing descriptions of them. Once a large number of photos connected with post-Second World War Hungarian history had accumulated, it was seen that storage in a database would facilitate repeated use of them. Only the descriptions were included in the database at first, while the digitalized, high-resolution files and associated thumbnails were stored in a separate cd-rom library. The photo descriptions began to be made available on the Internet in 1998. Although the user interface operated in a rather cumbersome way, there was quite a large demand from schools, public institutions and the press, mainly on the occasion of various national commemorations. (http://www.rev.hu/foto/plsql/foto_1a_www_e$.startup) For the Oral History Project presented at the 2004 IASSIST conference, the structure of the database was transformed and integrated into the contemporary history database, so that photo documents would be incorporated in a direct and uniform way into all the Internet projects the Institute is preparing. (http://server2001.rev.hu/oha/index.html) On the structure of the photo archiving and photo database in the presentation, I would like to mention the following issues and problems: 1. Technical issues concerning digitalization of the photo documents (resolution, file format, data storage). 2. Problems relating to collection, archiving and provision of the photo documents and copyright issues. Expansion of the work of collecting photo documents. 3. Description of the photo documents in the light of international standards and recommendations. 4. A short presentation of the database (structure, how photos are linked to other documents, events, persons etc.) Making the database multilingual. Search facilities and other demands. 5. Experience at the Institute with using the database.
ASPECT: a digital library approach to Scottish electoral data
Jane Barton (Centre for Digital Library Research, University of Strathclyde)
Alan Dawson (Centre for Digital Library Research, University of Strathclyde)
Andrew Williamson (Centre for Digital Library Research, University of Strathclyde)
This paper reports on ASPECT (http://gdl.cdlr.strath.ac.uk/aspect/), an initiative which addresses both the public’s need for ready access to empirical data and ephemeral information associated with Scottish parliamentary elections, and the scholarly community’s need for continuing access to and preservation of such data and information. ASPECT has created a digital archive of electoral ephemera, together with detailed election results and statistics, from the 1999 and 2003 Scottish parliamentary elections. In the second phase of development, the digital archive will be extended to incorporate party election broadcasts, hustings, media interviews and news reports. The aim is to support both participative democracy in Scotland and retrospective analysis by the scholarly community; initial usage statistics suggest that ASPECT is already attracting a diverse range of users. The data covers not only the historic election of the first Scottish parliament for 300 years, but also the subsequent election in which the Scottish political context moved from a four-party to a six-party structure. The data provides unparalleled insight into the ways in which politicians’ discourse addresses the need to construct party identity in this more complex context and reflects Scotland’s multiple identities as a distinctive nation within the UK and the EU.
The North American Jewish Data Bank: a rare population archive
Cindy Teixeira (Roper Center for Public Opinion Research, University of Connecticut)
The North American Jewish Data Bank is a social science data archive focusing on historical and contemporary life of Jewish Communities in the United States. During the summer of 2004 the Data Bank moved to the University of Connecticut to better achieve its vision. This presentation will focus on current efforts to realize the three main goals of the Data Bank: 1) to upgrade the Data Bank's datasets to today's processing standards including material organization and metadata storage, 2) to increase the archival holdings of the Databank with both newly acquired datasets and supplemental materials from Roper Center collections, and 3) to stimulate and facilitate research and teaching using the Data Bank. The North American Jewish Data Bank is a collaborative project of United Jewish Communities, the Center for Judaic Studies and Contemporary Jewish Life at the University of Connecticut, and the Roper Center for Public Opinion Research.
2005-05-27: G2: Gaining New Insight from Tables and Aggregate Data: Pivotal News
The FRB and XML: national data and international standards
San Cannon (Division of Research and Statistics, Board of Governors of the Federal Reserve)
Given the speed of technological progress, web users have become savvy surfers who are no longer content to look at static information on a text page. Users of Federal Reserve Board data are no exception and in an effort to address the needs of these more sophisticated users, the Board has undertaken to provide interactive access to the wealth of data it provides to the public. We have developed a web product which will allow both novice and power users easy access to exactly the data they want. From guided queries to batch download capabilities, users are able to specify which series, date range and format they’d like for their download. The data are modeled in XML using the SDMX (Statistical Data and Metadata eXchange) standard developed by a consortium of international agencies. Data modeled in XML using this new standard can be easily served up in a variety of formats, including XML, which will facilitate data exchange with other organizations that use the standard. We have developed a system, which will be demonstrated, that meets the current user needs and yet positions us for Web services and other technologies on the horizon.
Bring your tables to the Web
Jostein Ryssevik (NESSTAR Limited)
The presentation will focus on various aspect related to the documentation, publishing and presentation of aggregated multidimensional data (cubes) on the Web. Issues to be covered are: a) the principal differences between survey data and aggregated data, b) the DDI extensions for aggregated data compared to other alternatives (like SDMX) and c) elements of the Nesstar solution for aggregated data. Examples will be drawn from a variety of live data services based on aggregated data.
Data management lessons learned from developing GPW v3: implications for users
W. Christopher Lenhardt (CIESIN - Columbia University)
CIESIN and it partners have recently released their most recent version of Gridded Population of the World, version 3 (GPW v3). This data set is unique in that it integrates traditional social science data, population data derived from national census counts, with administrative and boundary data to produce a new data product with many interdisciplinary applications. Data file inputs number in the thousands and the output data set files and documentation number in the hundreds. The volume of these data presents unique data management and documentation issues above and beyond the usual technical issues related to the methodology behind the production of the integrated data product. This paper will present lessons learned from developing GPW v3, as well as implications of these lessons for end-users of the data. These issues gained increased salience in light of the use of GPW v3, and related data products, in response to the December 2004 Indian Ocean tsunami.
Strengths and weaknesses of the DDI Aggregate Data Extension in directly driving an on-line data visualisation system
Humphrey Southall (University of Portsmouth)
The "Vision of Britain through Time" website, UK National Lottery-funded, presented at IASSIST 2004 and launched last October, makes diverse historical information on Britain's localities available to the general public, but most of its contents are statistical: currently, 10.6m distinct data values. This paper focuses on our implementation of the DDI Aggregate Data Extension, fundamental to the system's internal operation: data are all held in one column of one database table and must be assigned to an nCube to be mapped or graphed. Some extensions reflect the particular application: the definition of themes and rates as additional entities. Others may be of more general applicability: defining universes and measurement units as entities, not attributes. Holding data in Oracle means the data map holds "cell references", not physical locations. Our metadata has to organise data from multiple censuses to chart change over time, making metadata authoring an act of interpretation.
2005-05-27: G3:Transforming Data Archives: the Latest Insights
A new data infrastructure for the humanities and social sciences in the Netherlands
Peter Doorn (NIWI-KNAW)
Transforming National Data Services: Australia
Deborah Mitchell (ACSPRI, Australian Social Science Data Archive (ASSDA))
Sophie Holloway (ACSPRI, Australian Social Science Data Archive (ASSDA))
The Australian Social Science Data Archive (ASSDA) faces interesting challenges in acquiring and distributing social science data to a relatively small but widely dispersed - in geographic terms - research community. This presentation will examine how new technologies can assist in transforming an archive that has primarily functioned in a highly centralised manner to a distributed archive that operates nodes in major cities around Australia. The first part of the paper will concentrate on the institutional and management issues that arise fom such a change. The second part of the paper will address a range of technical challenges, including: the provision of a seamless access interface to holdings for users; maintenance of consistent archiving pratices at each node; and describe a software program (APS) developed by ASSDA staff to achieve a common standard.
The data infrastructure in Central and Eastern Europe: current situation and prospects
Birgitte Hausstein (Central Archive for Empirical Social Research, University of Cologne)
Ludmila Khakhulina (Independent Institute for Social Policy Russian Social Data Archive)
Larissa Kosova (Independent Institute for Social Policy Russian Social Data Archive)
Janez Stebe (Social Science Data Archive, University of Ljubljana)
The data infrastructure in Central and Eastern Europe: Current situation and prospects It took about two decades to implement the idea of data archives in western Europe. When the first European data archives were founded in the sixties, social sciences were considered data poor and an infrastructure for social research was weak. As regards Eastern Europe on the one hand we are facing similar problems, but on the other hand there are completely new preconditions for setting up a data archive. New technologies and the World Wide Web have opened up undreamed-of possibilities for creating a social data archive. Additionally, the new archives can take advantage of the experience and support of the well established data archives. This presentation will describe the development of data services for social sciences in the East European countries and it will discuss the need of creating special networks and experts groups to concentrate the efforts in the field of the development of archive tools and meta data production. It will introduce the activities of the East European Data Archive Network (EDAN). Larisa Kosova from the Russian Data Archive (RSDA) and Janez Stebe from the Slovenian Archive (ADP) will report on their experiences and new challenges.
2005-05-27: H1: Becoming Enlightened about Discovering Data: Finding Evidence
Citing statistics and data: where are we today?
Gaetan Drolet (Statistics Canada)
It has been fifteen years since Sue Dodd presented her discussion paper entitled: Bibliographic References for Computer Files in the Social Sciences: at the 1990 IASSIST conference. Since then very little has transpired to improve the use of citations even though statistics and data today has become more complex with the variety of data formats available. It is time to resume the discussion. This paper explores the challenge for IASSIST and IFDO in developing and promoting a culture of citing statistics and data. Existing practices will be discussed as well as the link between metadata and citation i.e. Data Documentation Initiative (DDI) to Data Citation Initiative (DCI). The characteristics of proper citation will be noted including the benefits of full and consistent citations to users, data producers and data distributors. Lastly, an exciting new web tool for citing statistics and data, available through the Data Liberation Initiative (DLI) in Canada, will be demonstrated.
System of subject headings for Russian Federation budget data information system
Anna Bogomolova (Moscow State University)
Tatyana Yudina (Moscow State University)
Budget data is among the most socially demanded stuff everywhere in the world. Available at local, regional and federal levels budget data composes a research base for investigations and a decision support holding for powers. It is also vital for citizens and public initiatives. Given that budget statistics in Russia is the most accurately gathered and regularly updated stuff of all other state data it is a social challenge to implement an information system integrating local, regional and federal statistics and government agencies reports, think tanks publications and academic papers on state finances and compliment it with developed subject-oriented search instrument. An information system on budget data is implemented and updated as part of the University Information System RUSSIA (www.cir.ru/www.budgetrf.ru). The most ambitious part of the product is a System of Subject Headings for Russian Federation Budget Data processing and integration. The System of Subject Headings will be also used as a search instrument to navigate in Russian Federation Budget Data. As a first step a RF state finances ontology was formed. A beta version consisted of 100+ categories, it was presented for evaluation to a group of specialists in the field. The comments were gathered and discussed. By collective efforts a SSH final version was composed. The work has started on each category terminological presentation/description/support. Terminology is borrowed from the UIS RUSSIA thesaurus (70000+ descriptors and terms). As a next step a SSH-thesaurus (120 categories, 5000+ descriptors with terms) will be created and tested while processing the budget data and documents. Specialists engaged in the project will evaluate the results to provide for a tool horning. The instrument created will be implemented to search across an integrated holding of the RF budget data and documents investigating the finance situation and system analysis of economic and social processes.
Rob Procter (National Centre for e-Social Science)
Within the past five years, the Grid has evolved from its beginnings as a highly specialised research tool to its adoption as the blueprint for a new kind of global computing infrastructure. This has seen Grid computing being taken up by a wider research community and the emergence of new forms of research practice now encapsulated within the notion of ‘e-Science’. Grids come in a variety of forms: computational grids, data grids, access grids and sensor grids. This paper examines the possible forms and potential impact of sensor grids for social science research. The value of sensor grids stems from the way they enable researchers to manage remote data gathering via distributed networks of instruments. Examples of sensor grids in scientific research include networks of environmental monitoring devices. Analogously, the notion of sensor grids for social science research envisages harnessing the progressive ‘instrumentation’ of the social world through, for example, CCTV, mobile phones and ubiquitous computing devices: digital data about social phenomena is being generated on ever increasing scale as a by product of the everyday activities of social actors and could provide a richer picture of social phenomena than is available through more conventional data gathering techniques. The paper will discuss possible research applications for social science sensor grids, including the real-time analysis of social patterns and processes, and issues for their practical realisation, including data integration and management, and ethical issues relating to access, security, confidentiality and privacy.
Kenneth Miller (UK Data Archive, University of Essex)
One of the casualties of the UKDA moving from a Unix system supporting INGRES to a Microsoft environment using SQLserver was the BIRON search system. It was decided to replace it with a simple Google type search interface based on siteserver indexes. With Microsoft no longer supporting siteserver, along with the new requirements of the ESDS specialist services and internal demands for more powerful tools, the UKDA decided to review its resource discovery options. This paper discusses the relative merits and resource implications of using SQL fulltext indexes, z39.50, OAI and web harvesters. The presentation will include demonstrations of the more powerful and advanced searches now available.
2005-05-27: H2: Shaping Metadata Insight: The Metadater Tool
Metadater: data models and tools for documenting comparative research data
Ekkehard Mochmann (GESIS- ZA Cologne)
Uwe Jensen (GESIS- ZA Cologne)
The MetaDater project develops a Metadata Management and Production System for comparative social surveys repeated over space and time. All other survey designs are in fact reductions or simplifications of that model. The overarching objectives are to develop standards to describe as well as tools to produce and to manage related metadata. The scope of information the project has to deal with is metadata according to the general definition of the Data Documentation Initiative (DDI): "Metadata (data about data) constitute the information that enables the effective, efficient, and accurate use of those datasets". All developments are based on the analysis of all phases of a study life-cycle. Based on this analysis the conceptual and relational data models for MetaDater were developed. As agreed, the results of the model development and user analyses were provided to DDI-Alliance, which will take up the life cycle model. Currently the style sheets and functionalities for the data documentation by data collectors and providers are being developed. First user test will start in June 2005. MetaDater is designed to improve interoperability of social science data bases for comparative research and will thus contribute to the emerging social science data GRID.
The data model and data production procedures and dissemination
Marios Fridakis (Greek Social Data Bank at EKKE)
John Kallas (Greek Social Data Bank at EKKE)
The conceptual metadata model must meet two main requirements: a. Support the needs for data storing of the metadata management system b. If the data model is implemented in a relational database it must be possible to be used by other applications. Thus, the conceptual metadata model will support the functionality of the whole research product life cycle. If we try to model the whole research procedure we need a conceptual model of the whole system. We must differentiate between the conceptual model of the system and the conceptual metadata model, because the term 'conceptual' is misleading. The metadata management system will automate specific parts of the conceptual model of the system. We distinguish metadata entities in five categories: * The first category consists of the entities that describe autonomous documents produced in the context of a study at study level documentation (i.e. documents that are produced in the context of a specific study, and can be also used independently of the study). We will call these entities documentation objects. * The second category consists of the entities that describe autonomous objects produced in the context of a study at variable level documentation. * The third category consists of the entities that describe objects that depend on another object and that cannot be disseminated autonomously. * The fourth category consists of the entities that describe administrational characteristics of specific documentation objects, or of the entire system. * The fifth category consists of the entities that describe objects that are produced and used in the context of a metadata management system and not in the context of a study: the study description, the dataset description, the questionnaire, the project description, the file descriptions and the references to publications. The entities in this category have some common administrational characteristics, which are expressed in the conceptual data model as 'subtypes' of a 'supertype', table which is called "Documentation objects". Some of those functions will be presented in detail.
MetaDater's perspective on cross-national and diachronic data
Reto Hadorn (SIDOS)
For a long time, the data documentation standards have been limited to the description of cross-section studies. To some extent, this documentation schema can also be used for cross-national studies, as far as the latter take the form of a well known integrated dataset. The increasing importance of some cross-national projects in the last twenty years or so (Eurobarometer, ISSP, ESS etc.) raises nevertheless new expectations. The DDI-Alliance has created a special expert group to handle those questions. The CSDI (Comparative Survey Design and Implementation Network) works on quality standards for cross-national studies, of which detailed documentation is an important component. For the EU funded MetaDater project, working on a metadata management and production system, the repeated cross-national study was selected as the highest complexity to be handled in the data model and the application. The following questions must be answered in this perspective: * Is the ambition to describe with more appropriate categories the well known integrated data file or to document each of the country specific datasets? * If one chooses to document the single components of the cross-national study, a new question raises: how will the whole set of single cross-sections be described? Defining the whole as a 'collection' of cross-sections can do it; a more promising way would consist in relating organically questions and variables in those datasets, which are related over space or through time. If this network of relationships between questions, resp. variables, is established carefully, it can be used to support the process of integration of country specific datasets. At the end of the process, the questions and variables in the integrated dataset are related to the questions and variables in the integrated national cross-sections. This information can be fed into any standard publication of the metadata for the whole project, making it possible to navigate through the whole project.
THE Metadater data model and the formation of a grid for the support of social research
John Kallas (Greek Social Data Bank at EKKE)
The formation of a GRID for the support of social research strongly depends on the collaboration of data producers, data providers and analysts. The formation of the GRID will support: Building a subject matter ontology for every specific research field * Secondary data production * Data discovery * Support new study design * Supplementary research documentation To support the functionality of the GRID a social science infrastructure is needed but as the raw material for any social scientific production is data, a conceptual data model is the heart of this infrastructure. Existing information systems, used for primary production, data preservation and dissemination, cannot support the functionality of a GRID because the data models used in the three working phases do communicate badly. This presentation will show for each of the elements listed above how the integrated approach of the MetaDater metadata model supports of the functionality of the GRID.
Canadian statistics: evidence for enlightened democracy
Alan Bulley (Statistics Canada)
A distinguishing feature of the Scottish Enlightenment was its emphasis on participatory democracy. Believing that political power belonged to its citizens, Scotland promoted literacy to enable all Scots to debate and decide the issues of the day. An ocean away and more than two centuries later, Statistics Canada informs and educates Canadian citizens to help them take part fully in the life of their nation. To this end, the organization has created a range of public good products to engage Canadians—not only researchers, but students, journalists and business people—with reliable and timely information about their country. In elaborating on these themes, this paper will look at Statistics Canada’s ongoing redesign of its public good products, its use of usability testing and its active engagement of the business, educational and media communities to encourage statistical literacy and research. The paper will also examine some of the challenges facing public good statistics, such as meeting the needs of diverse publics and archiving dynamic publications.
Academic researchers and their use of digital data preserved in the U.S. National Archives and Records Administration
Margaret O. Adams (NARA)
The U. S. National Archives and Records Administration (NARA) offers access to records to anyone who seeks to have access, subject to the terms of the federal Freedom of Information Act . Throughout the 35 years of NARA’s custodial program for electronic records, academic researchers have acquired copies of more archival digital data files than has any other researcher group. New knowledge, new research techniques, new understanding of historical phenomena have emerged from researcher use of NARA’s archival electronic records. Some research has influenced social policy, some has answered questions about the nature of war casualties, while other research has informed our common understanding of ourselves and our nation. International scholars have used U.S. archival digital data to study the opinions of their countrymen on a wide-variety of issues posed to them by an agency of the U.S. government, collected by that agency for its own programmatic purposes. The proposed paper analyzes research use of archival data and is based upon administrative data from the most recent years of an archival reference program for electronic records at the U.S. National Archives and Records Administration. Both traditional and online forms of access to digital data will be discussed and the impact of offering multiple forms of access will be explored.
Economic data and publications as snapshots in time
Katrina Stierholz (Federal Reserve Bank of St. Louis)
The Federal Reserve Bank of St. Louis has initiated the FRASER (Federal Reserve Archival System for Economic Research) project, and is also adding a new historical component to FRED, called ALFRED. FRASER is an image archive of historic economic data publications which allows for sophisticated retrieval of statistical tables over many years. ALFRED is an automated system that will allow researchers to retrieve historical real-time economic data series. ALFRED is slated to go live in mid-2005. Initially histories will be available for about 25 data series, with more to be added in the future. These two new tools will allow researchers to access economic data as it was presented at a moment in time, and with each correction and update. For many researchers, it important to analyze data as it was available—rather than the perfected data released much later. Important policy decisions are made with this imperfect data—and determining the problems of that data as well as its usefulness is key to developing good models. For the casual user, it can provide information about economic conditions at a specific place and time. A significant feature of the FRED archive retrieval system is the option to enter both a date range (earliest and latest desired material) and a calendar "as-of" date. The presentation will discuss these two tools, their development, and their use. The presentation will also discuss the problems/issues of the naïve user. They may copy tables that have been OCRed, but the OCR has not been verified. We have decided to impose a burden on our users—we require that they register and sign an agreement, stating that they recognize the limitations of the OCR process. In particular, uncorrected OCR tables make it difficult to spot errors. Correcting the OCR is expensive. We plan to evaluate the use of the material, and correct the OCR if the use warrants it.
Large-scale, cross-sectional government datasets; research published and recent developments
Jo Wathan (Centre for Census and Survey Research, University of Manchester)
Vanessa Higgins (Centre for Census and Survey Research, University of Manchester)
The United Kingdom is fortunate in having a plethora of microdata datasets available for use by the academic community. Major cross-sectional datasets collected on behalf of central government and are routinely made available to secondary analysis via the UK Data Archive at the University of Essex, supported by the Government surveys group of the Economic and Social Data Service (ESDS Government) led by the Cathie Marsh Centre for Census and Survey Research at the University of Manchester. The data include a wide range of specialist surveys collected principally reasons of policy development and monitoring. The surveys are continuous, large-scale and cross-sectional. The portfolio includes surveys such as the General Household Survey, Labour Force Survey, British Crime Survey, Health Survey for England, Expediture and Food Survey, Family Resources Survey, British Social Attitudes Survey and national variants. Many of these have surveys have been running for decades providing considerable scope for assessing social change. This paper reviews the way in which these datasets are currently being used by the UK academic community and highlights the research potential offered by these important and highly flexible resources. The paper will also explain the way in which recent and future developments in dissemination and value added materials for users enable facilitate increased use of the data. Further information about surveys supported by ESDS Government can be found online at http://www.esds.ac.uk/government