2000-06-07: A0: Innovations in Large Scale Data Dissemination Systems
Digital Library Support for Sharing and Replication: A Report on the Virtual Data Center
Micah Altman (Harvard University)
Quantitative social science data are tools for research. The analysis of such data appears in professional journals, in scholarly books, and more and more often in more popular media. For the scholar, the connection between text and data is natural. We analyze data and publish results. We read the results of others analyses, learn from it, and move forward with our own research. But these connections are sometimes difficult to make. Data that back up an article are often difficult to find and even more difficult to analyze. Thus, our ability to replicate the work of others and to build on it diminished. We sometimes chase down the author of an article to find the data; and often the data are not there to be found. A similar problem exists for scholars who move from data to the published work based on the data. It may not be easy to trace the publications that emerge from a data set -- so that we can build on rather than duplicating that which has come before. Scholarship would be greatly enhanced if one could move easily from data to text and from text to data. The Virtual Data Center Project is an operational, open-source, digital library to enable the sharing of quantitative research data, and the development of distributed virtual collections of data and documentation. The Virtual Data Center is being developed cooperatively by the Harvard University Library, and the Harvard-MIT Data Center, and is supported by a research grant primarily from the National Science Foundation. In this paper, we discuss the prototype of the VDC, and plans for its first release, These releases extend the current system operating at HMDC in a number of ways. We show how the public release generalizes the software infrastructure and interfaces of the prototypes to enable the linking together of multiple (distributed) collections of social science data. We outline the major features of the alpha software, the results from analyzing use of our system.
The Data Web and FERRETT: Innovations in Integrating Distributed Federal, State, and Local Data
Cavan Capps (U.S. Bureau of the Census)
The U.S. Bureau of Census and the Centers for Disease Control and Prevention are engaged in a collaborative and innovative program to develop the Data Web to provide access across the Internet to demographic, economic, environmental, health, and other databases housed in different systems in different agencies and organizations. The Data Web is a collection of systems and software that provide data query and extract capabilities, as well as data analysis and visualization tools. This project is motivated by the requirements of scientists, researchers, academicians, businesses, and professionals who increasingly need real-time access to government and scientific data originating from diverse systems and disciplines. In the past, data have been system-specific. Data sets have been inextricably bound to the hardware and software systems that house them, and to the agencies that administer them. Access to data has depended upon access to the machines or the software systems that house the data. Recent technological developments make it possible to break this bind and allow for wider access to data, regardless of differences in underlying hardware and software, and regardless of organizational boundaries. Extending FERRETT and developing the Data Web enables participation and empowers data producers in all levels of government and in non-profits to take part in publishing, documenting and using the data they produce in an integrated statistical environment. The development goals are to improve usability, promote repurposing, and support statistical literacy. By building a federal link to state and local governments, data users have real time access to data for policy analysis and decision making. The distributed system is based upon the principles of 1) providing users what they need to know about the data throughout the process of identification, manipulation and understanding, and 2) providing powerful tabulation and analytical capabilities for knowledgeable repurposing of data.
FASTER: Accelerated Access to Official and Statistical Data
Simon Musgrave (Data Archive, University of Essex)
Jostein Ryssevik (Norwegian Social Science Data Services)
The main aim of this EC funded current project is to accelerate access to official and other statistical data by creating tools that are both easy to use and flexible. It will allow users to create their own personal data environment, derive all relevant contextual and supporting information and hence make the most productive use of data resources. This project builds on the work of NESSTAR and the DDI. It is focused on developing metadata models that move are applicable from the production of data through to analysis and will draw on the work of several initiatives in this area. The core objectives are to: 1. Develop an open architecture for the dissemination and use of statistics. The metadata specification will build on the earlier work and expert workshops will be organised in which to discuss appropriate semantics and structures for different types of statistical data. These will include time-series data, aggregate data, and all types of survey data. The syntax will be in XML. An important element will be the access control issues. 2. Develop a configurable user environment in which users are able to personalise their environment for immediate interaction (including visualisation and geographical displays) with aggregate and micro statistical data and other resources; 3. In order for 2 to become a reality two major issues will need to be resolved - data confidentiality issues (including disclosure control) and the proper authentication of users. The project will be co-ordinated by the Data Archive who will work in collaboration with four other European partners: the Norwegian Social Science Data Service; the Danish Data Archive; Statistics Netherlands and the University of Milan.
Steven Ruggles (Minnesota Population Center, University of Minnesota)
Catherine Fitch (Minnesota Population Center, University of Minnesota)
This paper describes a new NSF infrastructure project to create and disseminate an integrated international census database composed of high-precision, high-density samples of individuals and households. Our project has two components. First, we propose to collect data that will support broad-based investigations into the most important scientific questions facing social and behavioral science. Second, we will create a web-based data dissemination system that will incorporate innovative capabilities for worldwide access to both metadata and microdata. Large machine-readable census microdata samples exist for many countries around the world, but access to these data is highly limited and the documentation is often inadequate. Even where such microdata are available for scholarly research, comparisons across countries or time periods are difficult because of inconsistencies in both data and documentation. This project will provide basic infrastructure for the social sciences by making the samples publicly available, converting them to a consistent format, supplying comprehensive documentation, and by developing new web-based tools for disseminating the microdata and documentation over the Internet. The Internet is transforming the nature of electronic data dissemination. At the same time, the proliferation of fast personal computers and UNIX workstations has slashed the cost of large-scale data analysis. This project capitalizes on both of these developments by creating a population database of unprecedented size and power, and by providing tools to make it readily available for analysis on desktop machines. The project builds on our experience with the Integrated Public Use Microdata Series (IPUMS). The IPUMS is a coherent series of individual-level U.S. census data drawn from thirteen census years between 1850 and 1990. By putting all the census samples in a compatible format with consistent variable codes and integrating their documentation, the IPUMS greatly simplifies the use of multiple census years. Just as important, we have developed methods of electronic dissemination that have democratized access to these resources (http://www.ipums.umn.edu). The original IPUMS project includes 22 samples drawn from one country, the United States. It contains 65 million records totaling 25 gigabytes when uncompressed. Although this is one of the world's largest public-use databases, it is modest by comparison with our new endeavor: we plan to build a database with some 650 samples drawn from 21 countries on six continents, and it will include about 550 million records requiring some 250 gigabytes in uncompressed form. We will need to write the equivalent of about 14,000 pages of documentation, compared with 3,000 pages in the current IPUMS. This increase in scale will necessitate a proportionate increase in complexity, so we will develop new navigation and extraction tools to keep access to the data and documentation simple. Our task is not merely to convert hundreds of additional samples into IPUMS format. Because of international variation in census concepts such as "group quarters," and cultural concepts such as race and marital status, we will need to design the database from the ground up. This design process will be undertaken in close collaboration with international and domestic microdata experts. The basic design goals, however, remain the same as in the original IPUMS: we will create a system that simplifies use of the data and at the same time loses no meaningful information except when necessary to protect respondent confidentiality. The project will incorporate domestic and international data from a variety of sources. We will start with the U.S. census samples for the period 1850 through 1990 in the current IPUMS. Then we will add additional domestic samples to allow detailed study of the U.S. population in the late twentieth and early twenty-first centuries. Specifically, we will incorporate 528 monthly samples of the Current Population Survey (CPS) for the period 1964 through 2008, the 2000 Census Public Use Microdata Sample (PUMS), and the American Community Surveys (ACS) for the period 2000 through 2008. With these additions, the database will have a much stronger contemporary focus than the current IPUMS, and will be especially useful for national and local studies addressing policy questions. The international component of the database falls into two categories. For some countries, we will incorporate public-use census or survey samples that already exist, just as we have done for the United States. These data are generally well-documented, but they will pose complexities we have not previously encountered because of national differences in census concepts, cultural practices, and language. For other countries, no public-use census files presently exist. In these instances, we will create new anonymized samples drawn from surviving census tapes that were used to construct census tabulations for publication. In collaboration with the statistical offices of the countries concerned, we will explore new techniques to ensure full respondent confidentiality while maximizing detail. These data files are often poorly documented, and we will require extensive assistance from the statistical offices and experts of each country to ensure that we interpret them correctly. The development of metadata is central to the project and poses even greater challenges than the manipulation of the microdata. For every census and country we aim to provide comprehensive documentation at or exceeding the standards of the U.S. Census Bureau. The metadata will not be confined to codebooks and census questionnaires. As in the case of the existing IPUMS, we will provide a wide variety of ancillary information to aid in the interpretation of the data, including full detail on sample designs and sampling errors, procedural histories of each dataset, full documentation of error correction and other post-enumeration processing, and analyses of data quality. Both the data and the documentation will be distributed through an integrated data access system on the Internet. Users will extract customized subsets of both data and documentation tailored to their particular research questions. This will not, however, simply be a data extraction system. Rather, it will be a set of tools for navigating documentation, defining datasets, constructing customized variables, and adding contextual information. The most difficult task will be to provide a system whereby users can easily gauge the comparability of a particular variable in any sample to its counterpart variable in any other sample. Given the large number of samples, this level of documentation would be so unwieldy as to be virtually unusable in printed form. Accordingly, we will develop software that will construct electronic documentation customized for the needs of each user.
2000-06-07: B0: The Politics of Census 2000 -- Implications for Data Quality
Progress and Problems of Preserving and Providing Acess to Qualitative Data for Social Research: the international picture of an emerging culture
Kenneth Prewitt (U.S. Bureau of the Census)
The presentation discusses how the politics of census results became the politics of census methods, with emphasis on the ways in which Census 2000 controversies may affect data quality and data products. There will be an overview of political considerations that affect what questions are included in a decennial census, the factors that affect response rates, correcting for the differential undercount, and possible data problems related to item non-response.
2000-06-07: C1: International Infrastructures for Statistical Data
EUSTAGE: Towards a European Statistical Agency as an Intermediate Data Facility in Europe
Wouter De Groot (Netherlands Organization for Scientific Research)
Structure in The Netherlands: In 1994 the Scientific Statistical Agency (WSA) was founded by the Netherlands Organization for Scientific Research (NWO). Its main purpose is to open up important microdata for scientific research in the social sciences. The Scientific Statistical Agency acts as an intermediary between producers and providers of data on the one side and researchers on the other. WSA is part of the Social Science Research Council and is therefore close to research trends. In addition researchers are contacted to establish data relevance. The agency's board takes care of policy and strategic matters. WSA has a limited staff, because it focusses on its intermediary role and does not carry out any archiving. The data producer itself can be the provider. If such is not the case, WSA will call in the Data Archives of the Netherlands Institute for Scientific Information Services (NIWI) to take care of archiving and distributing the data. Opening up data for secondary use means improving availability and accessibility of microdata. In order to make data available WSA enters into long-term contracts with data producers settling financial and privacy matters. The agency may grant subsidies to update documentation or to make data suitable for secondary use. Occasionally a lump sum is paid to data providers. In this case a minor part of the data costs are passed on to the users. For this WSA has developed a tariff system based on the number of cases and variables. To ensure privacy protection several measures have been taken. Firstly, microdata are available for scientific research only. It is not allowed to use them for administrative or marketing purposes. Furthermore, identifying variables of respondents are removed from the data files and data are delivered to organizations only, and not to individuals. Before receiving data for the first time organizations are screened. If the organization is granted permission a contract is drawn up between the data producer and the research organisation. This contract states, among others, that data may not be merged with other data and that research output is registered and checked before publication on identifying tables or data. Finally, every individual researcher has to sign a secrecy statement. Currently the WSA data collection consists of surveys on persons and households by Statistics Netherlands (CBS), school careers at primary and secondary schools, labour market developments of graduates, labour supply and demand panels, the European Values Studies, and the CentERdata Telepanel. In addition a grant has been awarded to CEREM, the Centre for Research on Economic Microdata. This is an on-site facility at CBS where microdata on firms have been made available to researchers. Another goal of WSA is to improve the accessibility of microdata. Secondary users may not be familiar with the ins and outs of the data which may hamper their use. By presenting an overview of the data available and lowering the threshold for their actual use WSA tries to meet researchers' demands. This is done partly through the publication of the WSA-Catalogue. In this catalogue information is given about the data WSA has made available, the procedures involved to obtain them, and the tariffs at which they can be purchased. Two versions exist: one on paper and one on the Internet. The accessibility of microdata is also improved by the transfer of knowledge. Firstly, through data documentation, publishing articles in the Research Council's Newsletter, and by providing information on the WSA-website. Secondly, through user meetings and workshops. At user meetings data producers and users can exchange experiences regarding the microdata. Sometimes general methodological matters such as weighting or microsimulation are discussed. The result will be an improvement of the microdata concerned and valuable information for WSA about clients' wishes and research trends. Internationalisation and an interdisciplinary approach to solve complex research and policy problems are important research trends. As a consequence there will be an increased demand for data that are made comparable between nations, over a period of time, and at multiple levels (micro and macro). Another trend in data collecting is the combined use of register and survey data. This may complicate dissemination of the data for secondary use. Furthermore, the Internet offers new opportunities to present information about the data, but also in getting on-line data access. To anticipate these developments the agency uses a thematic rather than a producer-oriented approach to open up data and construct data clusters (collections of coherent data). Another development will be the documentation of Dutch data in English and the collection of international data. WSA's approach could act as a blueprint for other intermediary data organizations. Towards a European structure As information becomes available easier and in vast quantities the need for an intermediary organization becomes greater. At the national level several initiatives have been taken. During the presentation severeal cases will be presented. To stimulate the use of secondary data in other countries and to initiate internationally comparative and interdisciplinary research a European Statistical Agency (Eustage) should be established. Eustage could also act as a "data source infrastructure" by bringing together, documentation and making available existing data sources, constructing user-friendly uniform European datasets for scientific analyses and co-ordination of data collection for surveys on a European scale. The European Statistical Agency can facilitate the supply of secondary data that are indispensable to economic and social science research. Eustage can be supported by a limited staff as it will be connected to ICT-network facilities to the research institutes and the statistical bureaus of the European countries.
Web-based Data Enters Classrooms in Countries in Transition
Dusan Soltes (Comenius University, Bratislava, Slovakia)
One of the areas where the gap between the most developed countries and the current countries in transition from the Central and Eastern Europe used to be the most evident has been the area of computerization and in particular the utilization of computers for education. In the past, the main reason was mostly the lack of access to the most advanced computer technologies which for political reasons (embargo) were not available for the former socialist countries. Currently, the main problem has been mostly a lack of funding for acquiring a modern information and communication technology and especially those for the needs of the seriously underfunded education. It does not mean that there have not been an ever growing number of schools having access to Internet, www, e-mail, etc. But the whole this process has been to a large extent hampered by the lack of funding but also capacities for the development of the specialized, education oriented web-based applications and systems in comparison with the most advanced countries. - One of the initiatives how to help to solve this problem has also been a CMIS - Computerised Monitoring Information System on the Rights for Education (including those belonging to minorities) as a joint project funded by the UNICEF European Office in Geneva and the Slovak Committee for UNICEF under the worldwide program -Education for Development-. The CMIS has been developed as a tool for monitoring the process of practical implementation of the rights for education according to the United Nations Convention on the Rights of Child. In addition to this main objective of the system, it has also had an additional objective i.e. to support the process of the development and further perfection of the methods and techniques of education through the direct utilization of the more comprehensive web based computer application. - To the above main objectives corresponds also the main structure of CMIS and its four basic modules. These main modules are as follows: - Legislative database - it contains the list of all legislative acts which in the Slovak Republic legislativelly define and secure the basic rights for education. This data base enables also an access to all other websites with the full texts of the particular legislative acts. This database contains also references to the main sources of the particular international legislature - Statistical database - it contains all relevant statistical data on education in the Slovak Republic since its origin in 1993 regarding the mandate of the above Convention i.e. regarding education of children up to the age of 18 including some basic information on various types of the pre-school institutions, basic schools, high schools, teachers, etc. This data base contains not only -classical- statistical data but utilizes also technical means and tools for its graphical and geographic representation - International comparisons - this module focuses on the information which would enable international comparisons regarding national implementation programs of the above rights for education according to the particular UN Convention - Development resources - according to its above development objective, CMIS in this module contains direct access interfaces to various other worldwide education web sites either of the United Nations such as e.g. Voice of Youth, Education for Development, Education for All, etc. or of other web education sites providers and/or operators. The whole system has already been implemented as a pilot testing version and has been accessible worldwide with all main features of such a modern web based system as are unlimited and easy access, user friendliness, non-stop run, full text search, hot line and user comfort, openess, flexibility and expandability. In addition to its basic Slovak version it contains also a partial English version. In addition to its above main objectives as expressed by the CMIS modules, the system enables at the same time to meet also various other education objectives so important in the current globalized and mutually interconnected world. In this respect it enables to all users i.e.also pupils and students a direct active use of the modern information and communication technologies for a web based system, to learn in the modern form on some of their basic rights regarding their main activity i.e. education, to access and work with statistical data, to learn about their peers in other countries and to enter into direct dialogues with them and thus overcome some still existing isolation, stereotypes and to develop their own sense for international cooperation, against xenofobia, etc. Further on, the system enables them to familiarize themselves and activelly access various worldwide education programs and to use and improve their practical knowledge of English as a language of the www and the current globalization and thus prepare young people for the challenges of the contemporary world.
Developing Statistical Infrastructure in other countries: The Statistics Canada Experience
Ernie Boyko (Statistics Canada)
The ability to produce reliable official statistics is an important tool for a country to possess as it makes decisions about its social and economic infrastructure and programs. Many international programs are only open to countries that meet certain standards with respect to their national statistics program. Like any other technical infrastructure, statistical infrastructure requires investment, nurturing and specific expertise. Countries like Canada, which have well developed and respected statistical systems, are often in a position to share their knowledge and expertise with other countries. This presentation will provide an overview of Canada's statistical development work in other countries including China, Hungary, the Ukraine, Eritrea and Cuba.
International Training Activities in the U.S. Census Bureau
Robert D. Bush (U.S. Census Bureau)
Since 1947, over 12,000 participants have been trained in either the long-term or short term training programs conducted by the U.S. Census Bureau, both in Washington, DC and abroad. Graduates of our programs work in over 120 countries to improve the collection, processing, and dissemination of statistical information. Many of these graduates have risen to positions of leadership, not only in their own country, but also in international agencies. Current training is focused on short-term courses on the practical aspects of census and survey operations. Six such workshops are included in the current training cycle which spans the period from May - October, 2000.
2000-06-07: C2: Technical and Legal Perspectives on Digital Archives
Muddying the Waters: When the Laws Dictate Archival Decisions
Thomas E. Brown (NARA)
Public archives responsible for public records in all media are public institutions whose activities are governed by statutes and laws. Frequently, laws are enacted with unintended impacts upon the archives. This paper will review several laws within the United States and then explain the unintended legislative consequences for the U. S. Archives regarding electronic or other types of records. These consequences have included what records should be kept, where they should be kept, when they should be destroyed, and who can have access to them. The paper will conclude with some generalizations about the laws that dictate archival decisions.
Collection-Based Persistent Long Term Preservation
Robert Chadduck (NARA)
A summary of empirical results will be presented from National Archives and Records Administration sponsored research investigating the application of supercomputer-based persistent object technologies to support preservation and sustained access to ultra-high volume electronic records collections.
Authenticity as a Requirement of Preserving Digital Data and Records
Eun G. Park (UCLA)
Shelby Sanett (UCLA)
Assuring continued authenticity is an essential preservation consideration for digital data and records. What is authentic data? Which intellectual and technical elements of data and records are essential for assuring authenticity and how should these be maintained and represented over time? How are the authentic data and records used in various systems of practice? Doctoral researchers on the US team of the InterPARES Project (International Research on Permanent Authentic Records in Electronic Systems) will address these questions in light of findings to date of ongoing case studies and interviews being conducted with government agencies, academic institutions, and various organizations in America and Canada. The proposed paper will emphasize findings as they relate to the specific characteristics and function of authenticity in the preservation of digital data and records.
2000-06-07: C3: Building Connections among Heterogenous Terminologies and Multiple Languages
Seamless Searching of Textual and Numeric Resources
Fred Gey (University of California, Berkeley)
The dream for the use of new technology in libraries is to support seamless searching across an increasing range of resources on a growing digital landscape. The reality is that network-accessible digital resources, like the contents of a well-stocked reference library, are quite heterogeneous, especially in the variety of indexing, classification, categorization, and other forms of "metadata." The contribution of this project, funded by a 1999 National Leadership Grant from the Institute for Museum and Library Services, is to demonstrate improved access to written material and numerical data on the same topic when searching two quite different kinds of database: text databases (books, articles, and their bibliographic records) and numerical data (socio-economic databases). More: https://web.archive.org/web/20000915082853/http://www.src.uchicago.edu/datalib/ia2000/prog/gey.txt
Metadata in International Database Systems and the United Nations Common Database (UNCDB)
Robert Mayo (United Nations Statistics Division)
Current developments in Internet technologies and their global reach Provide national and international government statisticians with many new Opportunities to provide statistical data and metadata to users. This paper provides an overview of current practices of providing metadata on sources and methods in international data systems and focuses on systems designed to provide data via Internet. The paper discusses the essential elements of a metadata system in an international statistical system. The metadata system of the United Nations Economic and Social Information System Common Database will be discussed in detail.
Limbering Up: Preparing for the Race to Provide Multi-lingual Access and Automatic Indexing
Ken Miller (University of Essex)
LIMBER (Language Independent Metadata Browsing of European Resources) is a European Union, Human Language Technologies funded project lead by the Central Laboratory to the Resarch Councils (CLRC) at Rutehrford Appleton Laboratory (RAL). The other partners in the project are The Data Archive at the University of Essex (UKDA), the Norwegian Social Science Data Archive (NSD) and Intrasoft. This paper will outline the aims of the project, the technologies, architecture and standards to be utilised, the products it will deliver and the progress made to date. Its major objectives are to aid resource discovery, interoperability and mapping between terminologies. Its main deliverables are a) to develop and demonstrate a multilingual query and retrieval tool working in several languages and using a multilingual thesaurus, with keyword and phrase translation and b) to develop and demonstrate tools to support the construction and maintenance of databases using automatic indexing.
From Many, One: The Archival Infrastructure for U.S. Social Science Data Research
Margaret O. Adams (NARA)
Cindy Severt (University of Wisconsin)
Ilona Einowski (UC Data, Berkeley)
Janet Vavra (ICPSR)
As data professionals consider the role of data in the emerging infrastructure of "digital libraries," it may be useful to examine the archival institutions in the U.S. that collectively already form the organizational infrastructure for social science data research. This examination provides an opportunity to investigate the unique missions of each of the archival institutions represented, and to consider them as complements to each other. Who is responsible for what? Who serves which researchers? What kinds of social science data does each preserve? What is the relationship between these institutions and the emerging virtual digital libraries? Participants are Janet Vavra, Inter-university Consortium for Political and Social Research; Ilona Einowski, UC Data Archive, focusing on the archiving of state administrative welfare data; Cindy Severt, University of Wisconsin's Data Program Library Service, focusing on the kinds of uniquely-held archival collections at some of the academic data libraries/archives); and Margaret O. Adams, Reference Services Program Manager, Center for Electronic Records.
The ARL GIS Literacy Project: Support for Government Data Services in the Digital Library
Mary French (University of Missouri-Columbia)
This presention will describe the ARL GIS Literacy Project and its role in providing support for continued access to government data which is increasingly distributed only in digital form. In particular, it will address the University of Missouri's (MU) experience in the broad context of the ARL GIS Literacy Project goals as well as in comparison to the reported experiences of other participating institutions. It will discuss what MU has produced in terms of GIS services, what has been learned about broadening awareness of GIS, and what impact participation in this project may have had on social science research conducted at MU. The MU experience will be examined as an example of the creation of support mechanisms for integration of GIS into the digital library environment and into interdisciplinary research.
The Role of GIS in the University
John C. Hudson (Northwestern University)
Developing and Using Finding Aids for Geospatial Data
Steven Morris (North Carolina State University)
Researchers are now presented with a bewildering variety of options when trying to select geospatial data resources. At the North Carolina State University Libraries, for example, 20 different sources of U.S. census tract boundary data alone are available for use. Data resources are selected according to the requirements of the individual user and project. Factors to consider in the selection process include: data accuracy, scale (or level of detail), file format, file size, currency, concurrency within the dataset, coordinate system and datum, use restrictions, availability of metadata and lineage information, tiling scheme, and availability of feature attributes. Geospatial data users need access to finding aids that allow one to: a) determine what data resources are available, and b) select the appropriate data for use. Types of geospatial finding aids include: * Keyword search (geospatial metadata index or standard library catalog) * Browse search (including access to browseable metadata) * Clickable or draggable map interfaces (spatial metaphor) * Thesaurus-based access (feature- or attribute-based lookup) * Gazetteer lookup (place-based lookup) * Interactive mapping (pre-acquisition data evaluation or "scratch 'n sniff") * Data resource guides (discussion of resources) * Data specialists (data librarians or archivists) At the NCSU Libraries a range of data discovery tools are offered in support of unmediated, time and location independent access to and use of extensive geospatial data resources made available on the campus network. While, to date, only a subset of the geospatial data holdings have been added to the library catalog, an extensive Web-based thesaurus lookup system facilitates access to networked, offline and Web-based resources. Web-based mapping systems are used to provide broader access to data, cultivate awareness of geospatial data resources, generate interest in the underlying data, and facilitate pre-acquistion data evaluation. Interactive Web-based map indexes are also used to facilitate map-based And gazetter-based lookup of resources. Extensive Web-based documentation of data resources augments the personalized assistance of the GIS data librarian. This presentation will provide an overview of geospatial data selection And finding aid issues. The experience of the NCSU Libraries in the development and use of these tools will be discussed. Special focus will be given to geospatial data resources of importance to the social sciences data community.
2000-06-07: D3: Effective Use of Technology in Delivering Data
Systems Implementation of the Integrated Public Use Microdata Series: ADescription of the Computing Infrastructure of the IPUMS
William Block (University of Minnesota)
This presentation will describe the computing infrastructure of the Integrated Public Use Microdata Series (IPUMS). The IPUMS is a coherent series of individual-level U.S. census data drawn from 13 federal censuses between 1850 and 1990 and is used by social science researchers worldwide (for more information on IPUMS, see www.ipums.umn.edu). The computers that currently serve IPUMS to the world are a mix of Sun and Intel architecture running various Unix operating systems, Fortran and Perl programs, shell scripts, and HTML. In spite of its apparent complexity, however, IPUMS development has been responsive to user input, packaged into a format suitable for installation at mirror sites, and successful at maximizing its computing capability at minimal cost and software-licensing overhead. We will begin by describing the IPUMS Data Extraction System, which has been available on the World Wide Web for over three years and undergoes continuous development. User suggestions have led to a number of improvements in the system, such as the ability for users to retrieve and modify old extract requests. The amount of online documentation available to users of the IPUMS Data Extraction System has also greatly increased. This paper will describe these and other recent software enhancements to the IPUMS Data Extraction System. Next we will introduce the software and features of IPUMS that comprise its ability to function in a compute cluster mode. The cluster design grew out of a need for greater computing power with minimal resources and has been in successful operation for over a year. Additionally, after IPUMS started receiving inquiries about establishing mirror sites of its Data Extraction System, we worked to make IPUMS portable and installable across multiple platforms. The IPUMS job scheduling feature will also be explained, as well as the process by which IPUMS is packaged for distribution and installation at remote sites. Last, we will introduce the latest advances in IPUMS computing and outline plans for future development of the system. This includes demonstrating a full-fledged stand- alone installation of IPUMS on a laptop that boots Linux and runs a virtual Windows operating system as a task. While originally designed for demonstrating IPUMS at academic conferences, some researchers would undoubtedly find such a setup useful. Future development plans for IPUMS include plans to re-implement the IPUMS Data Extraction System using Java and the incorporation of DTD and XML-compliant metadata coding schemes throughout the IPUMS.
The E-Codebook Data Extraction Web Interface
Ron Nakao (Stanford University)
This presentation will outline the conceptualization, development, and implementation of the E-Codebook Data Extraction Web Interface at Stanford University. The web application will be demonstrated. Finally, lessons learned and future plans will be shared. The E-Codebook Data Extraction Web Interface brings together in a 'one-stop-shop,' the information needed by data users to create extract files in a quick and easy-to-use web interface. This easy access to data allows users to focus their time and attention on their statistical analyses, as well as allows faculty to incorporate student selection of 'real-life' data in their instruction. Codebook information, such as variable names, labels, descriptions, values, value labels, and sample frequencies, are parsed into an Oracle relational database. Users 'shop' for variables via keyword queries to our codebook database and add any that they desire to their variable 'basket' for eventual extraction. Links to important online resources, such as codebooks, technical documentation, and other web sites, are also provided. Extract files can be downloaded by the user to their desktop PC for further analysis, or accessed directly via their Unix account on Stanford's university-wide Leland Systems.
Providing Student Workspaces for Data Analysis on the Web
Tom Piazza (University of California, Berkeley)
Students and researchers like the convenience of being able to do basic data analysis on the Web, without having to download a copy of the data. However, they often want to create their own recodes and computed variables for datasets contained in the data archive. At Berkeley we are currently testing a setup that uses the SDA system to establish three levels of access and control over a dataset: Level 1: Archive copy of the original dataset; never changed. Level 2: Recodes created by the archive or by a professor for a class; students and researchers can use these variables but cannot delete them or create others in this category. Users have access both to the Archive dataset containing the original variables and to this area containing "official" recodes. Level 3: Private workspace set up for each student in a class or for specific researchers; they can create and delete new variables that are available only to themselves; the workspace is password protected. They can combine in a single table or analysis variables from all three levels. The presentation will explain how this works.
2000-06-08: E0: Panel
Research Centres and Confidential Data
Paul Bernard (University of Montreal)
Michael Clune (University of California, Berkeley)
Ron Dekker (Netherlands Organization for Scientific Research)
Research Data Centers offer one means of providing researchers with access to confidential microdata from surveys and other administrative records. Increasingly, government agencies are relying on research data centers because of growing concerns about data security and confidentiality of respondents. Building a data center rather than releasing data publicly allows a data supplier to maintain control over data uses and safeguard the confidentiality of respondents while providing researchers access to very detailed data. Data agencies have found that the research data center model is particularly useful for inherently sensitive data, such as firm and establishment records, health records, and files with detailed geographic identifiers. In most cases, the data available through research data centers is not suitable for public release. In other cases, agencies may employ a tiered data release strategy whereby a public use file with limited detail is released and a more detailed file is available at the data center. As pressures to restrict the amount of detail available in public use files grow, data centers are expected to increasingly rely upon this strategy. This session surveys the use of data centers in four countries: Canada, the Netherlands, Norway, and the United States. Panelists will provide an overview of the types of data available at RDCs and the procedures for obtaining access. In particular, panelists will highlight aspects of the research process which are unique to data centers including, the project selection process, security measures, legal safeguards, access fees, special project requirements, and limitations on research output.
2000-06-08: F1: Building on the DDI/DTD Foundation: Additional Perspectives on DTD Application and Use
The Development of a Generalized Resource Tool for Aggregate Data (GRETA) at the University of Minnesota
Wendy Treadwell (University of Minnesota)
This presentation will describe the progress of the Public Data Access System (PDAS) under development by the Machine Readable Data Center (MRDC) and Social Science Research Facility (SSRF) at the University of Minnesota. In its final form the PDAS will be an innovative web tool for accessing large sets of aggregate data by pulling both table structure and cell content information from a DDI-compliant XML-tagged codebook. The successful completion of this project will put the MRDC in a position to expand the amount of on-line data available to researchers, and would position MRDC to provide specialized data from the 2000 Census. The pilot project to be described at IASSIST uses Minnesota and limited U.S. level data from the 1990 Census Summary Tape File 4. This information includes population, economic, educational and housing data for 49 racial and ethnic groups and covering standard Census geographies. The information provides detail not found in the commonly available data from the 1990 Census and has not been mounted on the web by other institutions. This file is of particular interest to researchers who need to analyze the condition of specific racial groups in small geographic areas. It has satisfied numerous requests for data from the MRDC and we anticipate wider use when web access is provided. The MRDC serves two major constituencies through its dual role as the social sciences data library in the University Libraries and Tape Depository custom-service provider for the Minnesota State Data Center. The SSRF has a background in providing research support for social sciences data users as well as technical support for on-line collections such as the Integrated Public-Use Microdata Series (IPUMS) created by the Historical Census Project at the University of Minnesota.
Marcus Schommler (InformationsZentrum Sozialwissenschaften)
The ISSP (International Social Survey Programme, http://www.issp.org/) is a continuing annual programme of cross-national collaboration on surveys covering topics important for social science research. Every year a common reference questionnaire regarding a specific topic is the basis of the surveys in meanwhile more than 30 participating countries. The ZA (Central Archive for Empirical Social Science Research at the University of Cologne, http://www.za.uni-koeln.de/) has been the official archive of the ISSP since 1986. The ZA is responsible for processing, merging and archiving all national and cross-national data sets and for distributing data and documentation (including code books) of merged ISSP data sets to the scientific community of social scientists. To carry out this task it is necessary to map the questionnaires of each country onto the reference questionnaire. Up until now this has been done by controlling and editing SPSS setup files with a simple text editor. This method is very time consuming and error-prone. The goal of the information system "ISSP DataWizard" is to assist the ZA in doing the mapping in a more effective way. One of the central problems for the information system is the restricted screen space. A huge range of information is relevant for the mapping of the questionnaire - but only a selection should be displayed on the screen so as not to overload it with data. One solution for this problem is adaptivity. The information system offers data (in lists, comboboxes etc.) depending on the specific context the user deals with. Information hiding prevents the user from seeing information that is not relevant in this particular context. The rule based system aims to assist the user in making decisions regarding which parts of the specific country questionnaire can be mapped onto the reference questionnaire. The rules for mapping variables and values (the structure of the survey) are implemented in the system. The user can feed and edit other rules, that work on the questionnaires data. These rules check the plausibility and validity of the submitted data or create new data depending on this data. The information system is basically orientated on the WOB model (tool metaphor based strictly object orientated graphic direct manipulative user interface, Krause 1995), which was developed at the IZ (Social Science Information Centre Bonn). Some of the main scores of this model like dynamic adaptation and context sensitive permeance are used as software ergonomic principles. The system was developed using Java/Swing and has been used against ISSP data at ZA for testing purposes since November 1999. Beginning with the ISSP 2000 survey it is planned that the program will be used under production conditions. For the year 2000 survey the ISSP DataWizard will exclusively be used by ZA in Cologne. After completing testing successfully the software will be distributed to the other ISSP project partners. In this context XML/DDI is being considered as playing a key role as the future interchange format between ISSP members. As of March 2000, the subset of XML/DDI relevant to ISSP data is already supported for importing and exporting.
Where will DDI-documented Data Come From?
Tom Piazza (University of California, Berkeley)
The only people currently planning to document data using the XML standard of the Data Documentation Initiative (DDI) are a few of the data archives. Eventually, however, the survey organizations that produce data will have to be the ones to do it. How will that happen? In the short term, we need to provide helpful tools, so that it will become feasible for data producers to do "the right thing." One tool that I will describe is a set of procedures for documenting computer assisted survey instruments. Those procedures can be extended to generate DDI for computer assisted interviews and even computer assisted data entry. In the longer term, we may reach the point at which funding agencies, especially the National Science Foundation, require that data files produced through their funding be documented in accordance with DDI standards.
How to Use the DDI DTD in Day to Day Archive Practice
Marion Wittenberg (NIWI)
NIWI has made the choice to produce and store the metadata in a relational database and to use XML only as an exchange format. We think that in a database environment it is easier to control the documentation process. For publishing on the WEB and NESSTAR we export the metadata into XML files. Lots of European archives seem to struggle with this problem. As far as we know, most archives have not made a final choice. In our data model, we have made some enhancements to the DDI DTD. For instance, changes to document a data collection of historical data. Our data model has been extended to document the relation between the data and the original source. Another enhancement provides for including information about a series of data files. We are also evaluating another documentation system called ILSES. ILSES has the ability to link publications to studies and even to variables. In my presentation I will discuss the problems we have encountered and the solutions we have developed to convert the ILSES metadata into DDI XML files.
2000-06-08: F2: Panel
The Role of Data Quality in Social Science Research
Paul Bernard (University of Montreal)
Josephina J. Card (Sociometrics, Inc.)
Michael Carley (Sociometrics, Inc.)
Cathyrn Dippo (U.S. Bureau of Labor Statistics)
Decisions based on data can be affected by the quality of data that underlies the analysis that leads to the decisions. Most users look for data of the highest quality that they can afford for the purpose at hand. But what do we mean by data quality? Are there absolute measures? Who 'grades' the quality? What are the attributes of quality data? How is this information portrayed to users? What is the role of metadata? This session will be in the form of a panel discussion and will attempt to shed light on this important issue through a series of presentations by a data producer, a data professional and a researcher. Each will address this issue from their perspective. Cathryn Dippo from the Bureau of Labor Statistics will present the elements of data quality from a data producer's point of view using the Current Population Survey as an example. Data producers normally try to address the quality of data from the point of view of relevance accuracy, timeliness, accessibility interpretability and coherence. Ms Dippo will explain how the BLS attempts to meet these objectives. Josefina J. Card and Michael Carley will present a paper in which they will discuss three areas of major concern for data librarians and data users: data quality, format, and dissemination. They will explore how one large data collection, the Social Science Electronic Data Library (SSEDL), compiled over the last 17 years by Sociometrics Corporation, has addressed each of these issues and the conflicts and problems that arose during that process. Dr. Paul Bernard, Professor of Sociology, Universitie de Montreal, will speak from the perspective of a researcher that has used various public use micro data files from Canadian and US agencies. He contends that the quality of data are often discovered through their use and that there is a need for a feedback loop between researchers and producers to properly capture this type of information. He will also discuss the challenges of comparisons over time when standards and measurements change, the need for interaction between qualitative and quantitative research and the issue of describing quality in measurements of difficult phenomena such as relationships.
2000-06-08: G1: Whither IASSIST? or IASSIST in the 21st Century
Descriptive Figures and Hints for Further Research: IASSIST as a Virtual Community
Karsten Boye Rasmussen (University of Southern Denmark)
Repke de Vries (Netherlands Institute for Scientific Information Services)
As a group of early adapters of technology IASSIST has for long been an organization of international professionals communicating through electronic mail and only having face-to-face meetings through participation in a yearly conference. More recent IASSIST has become web-visible, and the informational web-brochure has been further improved by giving access to the articles of the IASSIST Quarterly (IQ) and an FTP exchange facility. This presentation is introspective as the object for investigation is the IASSIST organization itself. The success of establishing IASSIST as a "virtual community" will be estimated from a descriptive viewpoint. From ordinary and available statistics on the use of e-mail and the web-site the presentation will conduct a preliminary investigation into research questions as well as into practical issues of improvement of the electronic facilities and capabilities within IASSIST. The presentation will look upon the electronic possibilities of information spread towards a global audience, the contacting of authors by the readership of IQ-articles, effects of announcements on different email lists, the use of the IASSIST web-site by non-members, and the impact of search engines. Several methodological questions will be addressed. First of all issues of using incomplete and difficult to link existing data collections on web site and email list usage, demand discussion and methodological stratagems. Secondly the presentation is a step on a research path of "virtuality in organizations". In the IASSIST case it is intended to carry out several intensive and new data collections both by contacting the IASSIST membership and by having members and non-members identified while using the web-site. This will also raise a discussion - from a dual viewpoint of data collector and members of an organization - on the privacy and surveillance aspects of gathering more complete data on the individual by use of detailed logs from e-mail listservers and web-sites. Are researchers into virtual communities threatening the privacy of memberships, or are the legal directives of privacy threatening the research?
The IASSIST Five year Plan, Ten Years After
Chuck Humphrey (University of Alberta)
Ann Green (Yale University)
2000-06-09: H0: Culture Shock and Identity Crisis: Data in Libraryland
Culture Shock and Identity Crisis: Data in Libraryland
Deborah Dancik (University of Alberta)
Tom Parris (Harvard University)
Jean Sykes (London School of Economics)
The convergence of traditionally separate spheres of data providers -- libraries, data centers, computer centers, business libraries, departmental computer laboratories -- is a phenomenon that most of us are experiencing now. Whether this convergence takes place through mergers, take-overs, cooperative agreements, or simply increased communication, we are challenged with understanding our different attitudes, practices, histories and finding common ground that will allow us to make the best use of our increasingly connected environment. A panel of speakers from social science libraries and data service backgrounds will articulate a variety of perspectives on services, staff recruitment and training, collections, and system development.
2000-06-09: I1
Digital Library Federation: Reorganizing, Refocusing, and Moving Towards Concrete Collaboration
Daniel Greenstein (Digital Library Federation)
The Digital Library Federation (DLF) (http://www.clir.org/diglib/dlfhomepage.htm) is currently in the process of reorganizing and refocussing. The DLF was founded in 1995 to establish the conditions for creating, maintaining, expanding, and preserving a distributed collection of digital materials accessible to scholars, students, and a wider public. The Federation is a leadership organization operating under the umbrella of the Council on Library and Information Resources. It is composed of participants who manage and operate digital libraries. This session will provide an overview of the DLF's initiatives and plans for the future. There will be time for focussed discussion about the issues that concern participants in relation to data and other initiatives; and, how the organization and its partners can move beyond discussion to concrete collaboration.
2000-06-09: I2: Integrated Geospatial Information Systems and Services
Spatial Data Integration Challenges: The Gridded Population of the World Approach
W. Christopher Lenhardt (CIESIN)
EDINA Digimap: an internet mapping and data service for the UK Higher Education Community
David Medyckyj-Scott, et al (University of Edinburgh)
EDINA Digimap is a Web based service for UK Higher Education Institutions (HEIs), providing online access to Ordnance Survey (GB) digital map datasets. The paper will cover the following issues that the creation of the service has raised: * Who the stakeholders in the Service are, and what their involvement is. * What map datasets are made available through Digimap. These range from a very detailed dataset, showing individual buildings and gardens, to contours (at 10m vertical interval), and "road-atlas" style mapping. For all products there is national coverage although for the large scale data a rationing mechanism, restricting access to 30% of the data in any one year, has had to be devised. * Who is allowed to use the service and what for. * What copyright restrictions have arisen in providing the service, and how this impacts in practice. Specific issues that have arisen include use in teaching and research, electronic publication v. paper publication, place of use and data rationing. The functions of EDINA Digimap can be divided into two types, categorised by the technical requirements they put on the user's computer: Server side for "universal" access with minimal restrictions on hardware and browser software, and minimal learning curve. Client side, powered by Java for more sophisticated users. We expect EDINA Digimap to extend as a service in two general directions: greater breadth of datasets and additional functionality. The former could include heterogeneous data, such as aerial photography, geodemographic data and historic maps. The latter, data integration, on-screen addition of data, real-time data analysis and simplificiation. At the end of the paper we will summarise the lessons learnt from the creation of Digimap will be summarised.
Discover, Visualize, and Access Geomatics Data and Services: Current Concepts and Technology
Cameron Wilson (Natural Resources Canada)
The library and archival community commonly use and develop cataloguing schemas to store and retrieve analog and digital publications. A similar system is used in map libraries. The current challenge is how to optimize the retrieval, evaluation and access of geomatics data. Geomatics or geographical data commonly require search parameters in addition to the standard subject, author and date. Additional geospatial parameters include geographic extent and time range. Similarly, special software and knowledge are required to visually assess and access these data sets. Natural Resources of Canada, a Government of Canada Federal Department, produces A large volume of digital geographic data in conjunction with other Federal Departments, levels of Government and International sources. The geomatics concepts of Discover, Visualize and Access are presented in an operational theatre, and research and development context. International or adopted standards and associated current technology are discussed within this framework.
2000-06-09: I3: Dynamics of Digital Objects and Accompanying Metadata
Representing Metadata with Intelligent Agents: An Initial Prototype
Edward Brent (Idea Works, Inc.)
Albert Anderson (Public Data Queries)
G. Alan Thompson (Idea Works, Inc.)
This paper moves away from old metaphors for social science data description to the metaphor of an active agent capable of taking the initiative to assist the user in selecting appropriate data sets and variables as well as framing problems so that they can be answered with the data. Why can't various elements of metadata be active agents capable of telling both users and computer programs that use the data important characteristics of the data, tailoring information to each user's needs, and learning from each user to evolve over time? This paper describes an integrated approach in which intelligent agents permit the user to issue broad queries delegating the details to the agent; case-based reasoning guides the user to relevant examples; machine learning permits successful queries to be added to the program's expanding knowledge base for help with future queries; and expert systems provide advice to the user on a range of issues. Two prototype modules implementing some of these capabilities are described and the utility of this approach is illustrated for the PDQ-Explore system for providing rapid intelligent access to the U.S. Bureau of the Census PUMS, IPUMS and Supersample data sets.
Hyperlinking the World of Social Science: Integrating Text and Data in a Global Hypertext Space
Jostein Ryssevik (Norwegian Social Science Data Services)
The idea of connecting intellectual resources by means of (hyper)links was first conceived by Vannevar Bush in his article As We may Think in pre-digital 1945. Further down the lane the concept was brought to the Net - first as a vision in Ted Nelson's Xanadu Project (1960-), later as a generally accepted navigation tool in Tim Berners-Lee's first protocol for the Web (1991). Today jumping from one Web-page to another by means of mouse-clicks has become so trivial that we do not longer reflect over the power of the concept, let alone being thrilled by the fact that the linked objects might be stored on different continents. In the first generation of the Web hyperlinks was merely used to create shortcuts between fragments of texts. Today all kinds of digital objects, like pictures, sounds, animations and software components might be linked. For empirical social science there are at least two classes of digital objects that gradually are making their way onto the Web. One is the scientific text - the conference papers, the journal articles and in a few cases even the books. The other is the empirical data that the scientific texts are based upon. The paper will discuss and demonstrate (with live examples based on the NESSTAR platform) how this powerful technique can be used to integrate on-line texts with live data (not only dead tables or graphs but live data objects that can be analyzed and manipulated by the users). It will also discuss how these these new techniques (which in many ways are blurring the division between the role of researcher/publisher and reader) might be used to facilitate cumulative research or to create digital interactive teaching materials or knowledge gardens.
User Driven Integrated Statistical Solutions- Digital Government by the People for the People
Mark E. Wallace (U.S. Census Bureau)
The U.S. Census Bureau and other government agencies are exploring how best to respond to increasing user requests for concurrent access to multiple data sets and integrated data. Fulfilling this vision takes advantage of a historic opportunity for the federal statistical community and the citizens and taxpayers of our Nation as we enter the 21st Century. In collaboration with other government agencies, the implementation of this vision will produce a modernized, customer-driven, cross-program, and cross-agency integrated data access and dissemination service capability available via statistical portals such as FedStats. Census and other agencies will broaden information delivery, reduce data user burden, increase efficiency, and reduce redundancies by providing standards, processes and tools in the administration of information integration architectures; metadata repositories; product conception, design, and development; and new disclosure techniques. This paper describes these various efforts. As they begin to succeed, they will help build critical capabilities in the Nation's emerging statistical and spatial data infrastructures that will support global, national, regional, local, and individual decision support systems.
2000-06-09: J1: Archives and Assignments: Instructional Uses of Data Services
Archives as Teaching Tools: Using the American Religion Data Archive for Undergraduate Instruction
Roger Finke (Pennsylvania State University)
This session will give an introduction to the American Religion Data Archive (ARDA) and offer examples of how the site is being used by students and their instructors. The ARDA is an Internet-based data archive that stores and distributes quantitative data sets from the leading studies on American religion. Supported by the Lilly Endowment, ARDA strives to preserve data files for future use, prepare the data files for immediate public use and make the data files easily accessible to all. Because of the ease in using the site, ARDA has been used extensively by undergraduate students. The online auto-analyzer and complete documentation allow students to explore the data files online, and the courtesy software and download features allow students and instructors to download the data files to their own machines and networks.
Moving from the Blackboard to the Computer: Using Data in a Classroom Setting
Lisa Neidert (University of Michigan)
Instructional materials are increasingly including use of computers, data, data manipulation, and statistical techniques, even when the course is not a core statistics or research methods class. This is because of the growing realization that information literacy and more specifically, statistical literacy, should be a core competency for all academic disciplines. This paper will describe a course that introduces undergraduate students to census data and rudimentary data manipulation in a classroom setting. It fulfills a quantitative reasoning requirement at the University of Michigan. The paper will include a description of the course interface developed for classroom teaching, an example of a typical classroom exercise including the substantive background reading materials, and a demonstration of the interactive use of data. The paper will close with a description of how the classroom materials can be incorporated into stand-alone modules that illustrate points about data, a specific data set, a statistical concept, or a policy issue. At the very simplest, these modules are stable, but the ideal module is one that provides canned text and examples and then allows the user to modify and move beyond the canned example, i.e. a living book. The creation of these modules involves integrating substantive knowledge, data support, and computing expertise.
Promoting Use of Numeric Datasets in Learning and Teaching Through Enhanced Local Support
Robin Rice (Edinburgh University)
Melanie Wright (University of Essex)
This paper will report on a project in progress: "Using Numeric Datasets in Learning and Teaching," funded by JISC, the UK Joint Information Systems Committee for Higher Education. Making effective use of numeric data requires a greater range and depth of skill, more preparation, and more time than printed materials or bibliographic databases; and students and teachers require more support than is generally provided. While some problems can be solved by the data providers themselves, and others by a nationally co-ordinated approach to support for learning and teaching, others can only be overcome at the local level. By examining patterns at the institutional level, this proposal seeks to generate knowledge about good practice, and pitfalls faced by those charged with supporting teachers and learners wishing to make use of national data resources. The objectives of the project partners from two university data libraries and three national data centres are: - To establish a short-life Task Force on Using Data in Teaching and Learning, made up of local data service representatives and users of data services - To investigate the existing use of numeric data in learning and teaching through a national sample survey and case studies - To investigate the existing forms of local data service support - To carry out national studies to achieve these objectives and report back to JISC.
2000-06-09: J2: Qualitative Data: Collecting, Preserving and Sharing
Progress and Problems of Preserving and Providing Acess to Qualitative Data for Social Research: the international picture of an emerging culture
Louise Corti (University of Essex)
In this paper, I will offer a global picture of what is happening in the world of qualitative data archiving. Qualidata is in a strong position to be able to offer this insight as it was the world's first initiative to pioneer preservation of qualitative social science data on a national scale. This was facilitated by the Economic and Social Science Research Council (ESRC), Britain's largest sponsor of social science research, deciding to implement a mandatory policy for award holders to offer datasets of all kinds created in the course of their research. Over the past four years we have been approached for advice by many small, embryonic 'qualidata' projects, on issues surrounding archiving and providing access to qualitative data. For some years now there has been a strong international culture for archiving and re-using oral history data. This community has, however, few, if any, overlaps with the international social science data archive community. Yet, both have the same missions: to preserve and provide access to social science data. Qualidata has managed to provide a bridge across the two communities in the UK, and its main role is to co-ordinate information about the existence of all available sources of qualitative data in Britain, wherever they are housed. At the same time we have now persuaded a great many players in between, i.e. those who use other contemporary qualitative approaches such as ethnography and anthropology, to emerge out of their cosy cocoons and begin to consider sharing and re-using qualitative data. Qualidata's work has provided sparks of inspiration to a number of research groups across the world who were previously interested in the ideas of sharing data but weren't sure how to go about it. Many have used Qualidata as a model for developing their archiving procedures (which we should stress, was initially devised from a cross-fertilisation of UK Data Archive procedures and traditional archival repository procedures in Britain). Typically we have found these groups to be sociologists, almost none of whom had any contact or knowledge about the Social Science Data Archives in their own countries. We are still not aware of other major funders across the world who have realised the added value that archiving of qualitative data can bring. I will provide a world tour of who is doing what and where in this field. I will address briefly the range of objectives and strategies employed by these projects and then discuss optimal models of qualitative data archiving. I will also outline the formal network of qualitative data archives currently being formed, and some of the aims of this network.
Data Archives: Documentation of and access to sensitive data: the International Committee for the Red Cross project
Reto Hadorn (Swiss Information Service and Data Archive for the Social Sciences)
To mark the occasion of the 50th anniversary of the Geneva conventions, the ICRC organised a worldwide consultation with people who have experienced war in the past several decades in order to find ways to protect them better in times of armed conflict. The consultation was conducted with civilian populations and with combatants in 12 countries that have endured the modern forms of war. In the war settings, the consultation included national opinion surveys, as well as in-depth fo-cus group discussions and face-to-face interviews. In all, the ICRC project interviewed 12860 peo-ple in war-torn countries and conducted 105 focus groups and 324 in-depth interviews in following countries: Israel and Occupied Territories, Afghanistan, Bosnia-Herzegovina, Cambodia, Colombia, El Salvador, Georgia and Abkhazia, Lebanon, Nigeria, Philippines, Somalia and South Africa. Grossly sketched, the survey covers following topics: opinions and judgements about various acts and behaviours in war, the endangerment of civil populations, specifically women and children, the treatment of prisoners, necessity and possibility of better protection of non combatants, the role of international organisations in giving that protection. In addition, the consultation included national opinion surveys in four of the five permanent mem-ber countries of the UN Security Council - France, the Russian Federation, the United Kingdom and the United States - to see how the publics in these superpower countries view war. Finally, it included a survey of the public in Switzerland, the Depository State of the Geneva Conventions. The ICRC decided to deposit the data with the SIDOS data archive. The paper will present some of the main challenges data depositor and archive are confronted to: - The question of an eventually limited access to the data is central, given the sensitivity of the treated topics. The various solutions discussed with the data depositor will be reviewed. - An anonymizing policy and technique had to be elaborated in order to protect interviewed persons and persons named in the interviews. The interesting question is here: which are the limits to anonymization if you want to keep the meaning of what was said? - Given the variety of societies and cultures concerned, the careful documentation of the social, political and cultural contexts is a condition of valid interpretation of the transcripts. Which documents can effectively be collected and integrated into the documentation? - The complexity of that international research project explains the difficulties met in documenting some very concrete aspects of the research process. The ICRC is doing a great job in its attempt to compensate for the fact that the archiving takes place after the ending of the project.
Making Qualitative Research Reusable: Case in Finland
Arja Kuula (Finnish Social Science Data)
Social sciences in Finland have a long and continuous tradition in qualitative research. A great amount of significant qualitative material has been collected during the last few decades. Until now there has been no systematic way to preserve, index or catalogue qualitative social science research material in Finland. Finnish Social Science Data Archive (FSD) was founded in 1999. Already in the planning stage attention was paid also to qualitative social science research material. This kind of material is not archived by FSD, but one of our goals is to improve access to it and facilitate the re-use of qualitative research material. Our paper will present FSD's strategies regarding this goal. The Strategies comprise the following elements: - developing and maintaining an information database on Finnish qualitative research material that is reusable for academic research and teaching purposes, - developing documentation standards, - cooperation with Finnish qualitative data producers and collectors like academic research projects, the Finnish Literature Society etc., and - international cooperation with organisations that share similar objectives.
Informatics-based Support for Research and Education in the Field of Contemporary Studies
Zoltan Lux (1956 Institute)
I. Developments in informatics at the 1956 Institute The 1956 Institute deals with research into Hungarian history since the Second World War, with emphasis on the 1956 Hungarian Revolution and its development, subsequent effects and international aspects. Since its foundation in 1990, the institute has been amassing databases containing the documents gathered and used in the researches or descriptions of these. This has resulted, for instance, in a bibliographic database of books, articles and films about the 1956 Revolution. The 1956 Institute also holds an oral history archive of more than a thousand life-span interviews. The subjects of about 500 of these took part in the 1956 Revolution on one side or the other. These interviews (the transcripts of which cover 1000-2000 typewritten pages in some cases) have been used to compile a database of abstracts, which allows detailed searches to be made for persons, events, institutions and places mentioned in the interviews. Furthermore, it is possible to search from these details back to the interviews in which they are mentioned. More than 20,000 trials were held in Hungary during the reprisals that followed the 1956 Revolution. Exploring these trial documents is not simply important to an understanding of the 1956 Revolution and the subsequent reprisals. The database of those appearing in the trials is an essential resource for researchers into contemporary history and sociology in general. The 1956 Institute has been continually engaged for several years in compiling a database of these trials and those who appeared in them, in conjunction with several partner institutions, mainly libraries. The databases were built on a Hungarian-developed, network, free-text database-handling programme called TEXTAR. However, this was not capable of storing the full text of lengthier documents or multimedia documents, for example. In 1996, work began on transferring the databases to an ORACLE-based handling system. The new platform, apart of receiving the old databases, allowed new functions to be incorporated that the old system would not have supported (e.g. storage of complete texts and multimedia documents). The range of document types for database inclusion was expanded (to include audio-visual documents). The several separate databases were merged into a single database and the parallel records (of persons, institutions, etc.) were combined. The 1956 Institute does not aspire to become a vast virtual data archive. As a research centre dealing with contemporary Hungarian history and its international context, it is primarily a data producer, gathering data during its researches. The primary purpose behind this data gathering is to provide information of use to the Institute in its researchers' work and its publishing activity (books and web pages alike), and in instruction and public education. This need not be in a direct way. The Institute is also seeking opportunities to cooperate with larger virtual data archives. II. After briefly introducing the 1956 Institute, the lecture turns to the following problems in the process of data compilation and provision. - Researchers and databases. The problem of sharing individually possessed information. Disposal and control over information. Do researchers have a personal stake in sharing (or not sharing) information? - The scope for automatic gathering of information important to researchers and the practice in this respect. Identification of authentic sources of data. - Compiling digital publications-much more difficult without databases. - Multimedia databases-the foundations. - The costs of operating and expanding databases. How the Institute can finance the costs of keeping pace with the revolutionary development in informatics. - Data-structure, communicational and software standards (absence of these). III. There will be a presentation of a public section of the 1956 Institute database (chronology, photo documentation, Romanian historical documents, trial data archive) utilized by researchers and accessible on the web: www.rev.hu. Also presented will be an internet and CD-ROM publication based on the databases.
2000-06-09: J3: Economics and Business Resources in your Data Library
Technical Issues in Providing Economic and Financial Data
Paul Bern (Princeton University)
Deciding on what type of economic data you need and finding it are only the first steps. What are you going to do with it once you have it? In this presentation I will explore the various options and make some recommendations on how to provide and analyze economic data. First, we must decide on how we are going to make the data available to our users. Should we just hand them the file and say "Have fun?" Many of these files are very large and some are in formats not easily read by all statistical packages, so you may need to process the file yourself before making it available to your users. Second, your users will need to analyze these data. What kind of statistical procedures should you use? Most, if not all, economic analyses require some time component - looking at wages over the last ten years, for example. Panel studies - surveying the same people several times - are common and require special handling. Whatever statistical package you use, you must make sure that it has the proper techniques and options available to do what you need. Finally, what software should you use? What's better, SAS, Stata, or SPSS? SAS is, by far, the best at file management, and has procedures for reading CRSP, Compustat, OECD, IMF and many other data formats. Stata, on the other hand, is much better with time-series data and is much easier to learn than SAS or SPSS. SPSS has the best user-interface and good graphics. So maybe you'll use more than one.
Meeting the Needs of Economic and Financial Data Users
Heather McMullen (Harvard University)
Where can I find the monthly stock price index for France back to 1856? The Gini coefficient for 20 countries? Economists as well as other researchers from a variety of disciplines have an insatiable demand for economic and financial data, preferably as a time series in electronic format. Faculty members expect access to expensive financial data retrieval systems at their desktops. Economics undergraduates must obtain "real-life" data for analysis in a class project or thesis, yet the data retrieval process can consume a significant amount of time before the analysis has even begun. How are Data Centers and Libraries to cope? This talk will explore new user expectations for economic and financial data, and how these expectations differ from those of traditional users of social science data centers. Many issues arise in building support services for economic and financial data, including collection development, license negotiations, staff training and IT challenges. The service model at Harvard University and the relationships between various library units and the Harvard-MIT Data Center will be discussed. There will be an overview of several standard providers of economic and financial data, including commercial vendors, inter-governmental organizations and U.S. government data. Special attention will be given to strategies for developing expertise in economic and financial data reference.
Finding Global Economic Statistics is Easy. Isn't it?
Sean Townsend (London School of Economics)
Economics and particularly business, are amongst the most popular areas of study for University students. Certainly outside of the specialist sciences, they remain the most marketable of subjects for prospective employees. This growth pattern has moved in parallel, perhaps by coincidence, with a minature explosion in the resources available to economists across the globe. This knowledge boom enables researchers to find out just about any information concerning a national economy, and in some cases the data spans several decades and is disaggregated across various themes. The biggest challenge at the moment is very simply, how to find the right information within this ever increasing and diverse landscape. The old days of consulting tried and tested official sources are gone. Today economists can browse websites and cd-roms that boast data series in the hundreds of thousands. Drilling down and mining these resources is perhaps the most valuable research skill any social scientist can have today. As the expectations for dissaggregation continues to rise, the need for roadmaps and guiding lights becomes ever more pressing. This paper aims to summarise the nature of the problem, citing some examples, and also to provide a brief outline of the key resources that economists use today. Emphasis will be placed on European sources but will also include reference to global organisations such as the OECD, and content providers such as Primark.
2000-06-10: Workshops
Creating a Data Service
Jocelyn Tipton (Yale University)
Christof Galli (Duke University)
This Full Day workshop will introduce participants to a variety of topics in the design and implementation of a new data services department. We will examine how to manage and organize data services, identify and select data and documentation, and review issues regarding access and use of the data. The workshop is introductory in nature and is designed for new data librarians and data service providers.
Preparing Data for Your User Community
Bo Wandschneider (University of Guelph)
Gregory Haley (Columbia University)
This is the first part of a two part workshop that will introduce participants to various tools that can be used to read, massage and analyze data that come in different formats. It will focus primarily on the use of raw ASCII files. Participants will work through examples using SPSS, SAS, STATA, and PERL . This will be followed by a series of hands-on exercises to ensure that there is an understanding of material being discussed. Work with the statistical packages will be done in a windows environment, while the PERL session will be done under UNIX. The morning session will cover the following topics: SPSS. The participant will be shown how to read data into SPSS from non-spss system file formats. Topics covered will include such things as reading raw ASCII files, import, translate and, time permitting, some descriptive statistics to confirm that the data has been properly read. Data from a survey such as the Eurobarometer will be used in this exercise. STATA - this section of the workshop will be a brief introduction to the fundamentals of reading data and getting it into a STATA data set. This is the second part of a two part workshop that will introduce participants to various tools that can be used to read, massage and analyze data that come in different formats. It will focus primarily on the use of raw ASCII files. Participants will work through examples using SPSS, SAS, STATA, and PERL . This will be followed by a series of hands-on exercises to ensure that there is an understanding of material being discussed. Work with the statistical packages will be done in a windows environment, while the PERL session will be done under UNIX. The afternoon session will cover: SAS and the PSID. In this section the participant will be working with files from the PSID. Specifically the participant will be shown how to read from both the family and individual files and how to merge these records together. One of the issues with the PSID is the continuity of variables over time. An exercise will be given where the user will have to find a set of variables from 2 different years of the family files and use the individual files to link these records. The participant will rely on programs, codebooks and information from the PSID web site. PERL - this section will be a brief introduction on the use of PERL to read and manipulate large data files. In some instances this is easier and quicker than using the packages outlined in the first part of the course. Time, permitting other PERL utilities will be demonstrated. Enrollees in this course should have an understanding of the material covered in the AM session.
The Data Documentation Initiative: Creating XML Documents and XSL Style Sheets
Ann Green (Yale University)
Peter Joftis (ICPSR)
Bill Block (University of Minnesota)
Peter Granda (ICPSR)
Patrick Yott (University of Virginia)
The Data Documentation Initiative (DDI) is rapidly becoming an international standard for the content, presentation, transport, and preservation of "metadata," the information users need to select, evaluate, manipulate, and understand statistical data in the social and behavioral sciences. Information contained in traditional codebooks can now be created in a uniform, highly structured format that is easily and precisely searchable, that lends itself well to simultaneous use of multiple data sets, and that will significantly improve the content and usability of social science metadata. The DDI is also playing a significant role in the design and development of Web based data dissemination and analysis systems. This workshop will provide an understanding of the DTD, an introduction to how DDI documents are created and used with style sheets, and a review of how the DDI is used in the dynamic NESSTAR system of data discovery, usage, and dissemination. This is part one of a two part session. See DDI2: NESSTAR Part 1. Scope: Overview of the structure of the Data Documentation Initiative XML DTD, its current status and applicability. Instructors: Ann Green (Yale University) and Peter Joftis (ICPSR) Time: 30 mins Part 2. Scope: Introduction to authoring DDI documents. Participants will create sample codebooks using the UMn shareware authoring tool. The workshop will include a review of a basic codebook and how the material maps into the DDI DTD structure. Hands-on coding activity will be focused in two sections: 1) the study level information and 2) the file and data description portions of the codebook. Copies of the authoring tool and fully tagged codebook will be provided to all participants. Instructors: Wendy Treadwell and Bill Block (UMN), and Peter Granda (ICPSR) Time: 90 mins Part 3. Scope: Overview of the purpose and use of style sheets using XSL and DDI codebooks. Participants will be able to review and edit sample style sheets. Instructor: Patrick Yott (University of Virginia) Time: 60 mins
Publishing DDI-documented data through NESSTAR
Jostein Ryssevik (Norwegian Social Science Data Services)
Lene Wule (Danish Data Archive)
Ken Miller (University of Essex Data Archive)
Melanie Wright (University of Essex Data Archive)
The session will give an introduction to the NESSTAR system and how to run a NESSTAR server. A brief overview of the end-user tool (the Explorer) will be followed by a more thorough introduction to the tools that have been developed to make it easy to create and publish DDI-documented data, either on a NESSTAR server or through other systems. These tools includes various stand alone converters from existing formats to the DDI, a fully DDI-supporting statistical package with an integrated DDI-editor (NSDstat), and the NESSTAR Publisher which will allow archives/reseachers to develop their data/metadata and publish them over the Web to a (remote) NESSTAR server. The workshop will be a mixture of introductions and hands-on session and give the participants concrete experience on how to set up and populate a Web-based data library based on the DDI-standard. Particpants registering for this workshop are encouraged to register for Introduction to DDI.
Introduction to Geographic Information Systems (GIS) for Social Sciences
Steve Morris (North Carolina State University)
The workshop will provide a brief introduction to GIS and its application to social sciences. The workshop will consist of two chief components: 1) a basic overview of GIS principles and concepts, and 2) hands-on use of ArcView GIS software. Hands-on exercises will focus on U.S. Census numeric and spatial data, but the possibilities for integrating infrastructure, environmental, and imagery data into social sciences applications will also be investigated. A brief overview of relevant data resources will be provided. At the end of the workshop, participants will be provided with pointers to available resources for continued learning. Previous experience with GIS is not required.
Locating and Documenting US Spatial Data
Michael Furlough (University of Virginia)
For new GIS users, this half-day workshop will focus on the problems of of locating, identifying, and acquiring spatial data products in the United States. We'll have an overview of the US National Spatial Data Infrastructure and give some attention to interpreting and implementing standard metadata structures promulgated by the Federal Geographic Data Committee.