C. Concepts and Context
1. Full datasets vs quick stats
While the word “data” can apply to many things, and is sometimes used in the most general sense as “anything that needs to be found out”, to be more precise, we can make a distinction between statistics and datasets.
Statistics are ways of summarizing knowledge. They can range from simple quick answers to questions such as “what is the percentage of foreign-born residents in my country?” to more complicated, multilayered tabulations (such as crosstabs), which can answer questions such as “what was the difference in black and white personal income for the college-educated population in California in 2010?” Government resources are often the place to start for statistics, but commercial databases and nonprofit organizations also provide this level of information. But easy-to-access statistics depend on someone else making them available, which may not be the case for novel, highly detailed, or uncommon questions. Also statistics for marginalized populations may be less available due to privacy concerns and/or small numbers of respondents.
Datasets is the term usually used to describe a complete data file that research and other statistics are based on. These datasets may have information on every respondent to a survey or other data collection method. The term “microdata” is also often used in some social sciences fields to describe data that has each individual’s responses available in the dataset. A qualitative dataset may combine numeric and demographic data with each individual’s comments and responses to questions. Datasets do not provide quick answers like statistics can, but they can be used to answer any question that the data might support, including novel research. Due to their detailed nature, they may also have restrictions on access, as described in the next section.
Very often data on minority and vulnerable groups is not directly accessible. Due to its sensitive nature, data providers, repositories, and data archives may protect respondents by publishing the data in a secure environment. This may be done through an application process or through a virtual or on-site data enclave (see for example ICPSR on data confidentiality ). To grant access to sensitive data, the repository needs proof that the researcher handles the data securely and can deny access when proof is insufficient.
ICPSR also demands institutional review board approval to ensure good scientific conduct when granting access to sensitive data. Other data providers may expect a self-commitment to follow “the best possible standards of anti-discriminatory and anti-racist analysis” (see the Ethical commitment to the use of Afrozensus data ). In the UK, the UK Statistics Authority oversees the Research Accreditation Panel , which accredits all research projects that use sensitive data made available under the Digital Economy Act. Researchers must demonstrate that their project is in the public good, that it is ethical and that it will not cause harm or discrimination to minority or vulnerable groups.
3. Search terms and language
Keywords vs controlled vocabulary
Use all relevant terms or keywords and, if available, the database controlled vocabulary or indexing and cataloging terms in advanced search. Subject dictionaries, thesauri, and Credo Reference (subscription required) can help identify concepts and words for searching. Articles will often have keywords supplied by authors or the publisher that can be used for searching. A systematic, scoping or other evidence synthesis review on a relevant subject will also include in their methodology the search terms and databases used that will be helpful to harvest terms to use for searching.
Controlled vocabulary, often referred to as thesaurus, subject headings, or descriptors are used to index or catalog all records under one or more subject or descriptor terms for what that article/content is about, the geographic location, or type of material or form, even if different words are used. Data archives and databases may also use controlled vocabulary to index types of study and methodology or to group contents for browsing. Using a subject/descriptor allows you to bring up all records indexed with that subject/descriptor. They also help you to identify or discover other keywords for searching. Within a database record of a search result, you may see these subject headings in the subject field, genre/form or material type, or location fields. Because older works and more current works may not have been indexed, or a term is new or an emerging concept, search using both controlled vocabulary and keywords.
Catalogs, databases and data archives may have different controlled vocabulary for the same concept. In the US, search terms may include refugee and Dreamers. In Germany, guest workers. There may also be international variations in spelling (UK/US), terms or non-use of some terms. Many European Union and other countries do not have official or government collect race data but may collect ethnicity, nationality, language, immigration or citizenship data, and these variables may be used as a proxy for race.
- Read How Different Countries View Race. There may be non-official surveys that collect this information indirectly (Germany survey examples).
Library of Congress Subject Headings and database subjects are reviewed and changed, based on internal reviews, Funnel projects , user feedback and suggestions, and as more knowledge is produced. Many terms related to racial categories and subjects are undergoing revision and are changing quickly. Even with the change, older or inappropriate terms may still exist in the metadata. Even if you have used a controlled vocabulary before, check for updates. Some significant recent changes include:
- PubMed changing homeless to ill-housed persons.
- LCSH added Black Lives Matters and changed Slaves to Enslaved persons
Empirical studies - Methodology and other useful filters
For many catalogs and databases, the thesaurus is found in advanced search. Read the help or watch a video on how to use the thesaurus, subject headings, search fields and operators. Use database fields and filters, such as Genre/Form (e.g., data, surveys, polls, data sets or statistics), Date, Location, Methodology to focus and narrow your search.
Boolean, Wildcards, proximity and search operators can be used with keywords and controlled vocabulary to expand or narrow a search. These include AND, OR, NOT, PRE, ADJ, *, ?.
Databases may group specific indexing terms under a relevant category, such as “ERIC.ed.gov equity and bias ”. Empirical studies can point to relevant data or studies and authors can be contacted for more information.
Example: APA PsycInfo. Under Methodology, select Empirical Study to limit to empirical research.
For more examples, visit Databases with a Methodology Filter
1. Library of Congress Subject Headings (LCSH)
LCSH are the controlled vocabulary for indexing/cataloging books, media and other library collections used by Worldcat and in most US libraries and some international libraries. Data archives and databases may also borrow or build on the LCSH. Each subject heading shows synonym (variant) terms and hierarchical relationships among subject headings, with broad terms, narrow terms. There are also geographic subdivision headings (PDF) , demographic/ethnic group headings (PDF) , and genre/form (PDF) to browse for relevant headings.
Search for LCSH or enter terms from Suggested controlled vocabulary and search terms: For example: “discrimination ” lists specific headings such as Discrimination in housing, Discrimination in higher education, Discrimination in justice administration, and discrimination against African Americans or other groups.
Finding data with the library catalog should only be one of many strategies because many library records are not indexed as genre/form or document or material type so results will not be exhaustive. Government produced data and statistics and the ICPSR datasets can be searched in a US library catalog. The catalog may also identify databases that include data. Check the catalog search help to identify the appropriate data type field and terms. For example, the LCSH genre/form headings (PDF) include census data, demographic surveys, vital statistics, data sets, geospatial data, and statistics that can be entered in the genre/form field. WorldCat uses the field “format” and “computer file” as a type that includes “numeric data.”
Example using the HOLLIS (Harvard University library catalog) advanced search, with (“police shootings” OR “police violence”) as subject AND (a) one or all of these words in the tile: (statistics OR data OR survey OR poll OR dataset ) OR (b) enter within Form/Genre (statistics OR data OR survey OR poll OR dataset ) (when using form/genre, to expand search, enter “police shootings” or other terms as keywords).
Examples of database and data archives thesaurus
- ICPSR subject thesaurus
- HASSET Thesaurus - UK Data Service
- ELSST – European Language Social Science Thesaurus - Consortium of European Social Science Data Archives (CESSDA)
Examples of race and ethnicity variables
Datasets and surveys may use different variable names over time for race and ethnicity. Consult the dataset, survey or study documentation.
- IPUMS harmonized race and ethnicity variable
- General Social Survey Library Guide: Locate GSS Topics & Variables . For example, Wkracism asks about discrimination in the workplace. Topics for international comparison include social Inequality, national identity, and citizenship.
- Harmonized Latin American Innovation Surveys Database (LAIS): Firm-Level Microdata for the Study of Innovation: Dataset (Inter-American Development Bank)
National government and local government may have standards for race and ethnic data and categories to compare information and data across agencies, and there may be on-going revisions. These standards can be consulted for terms to use in looking up race-ethnicity data. For example, in the United States, there is a proposal to add “Middle Eastern or North African” (MENA) as a new race/ethnicity response category ( U.S. Office of Management and Budget Interagency Technical Working Group on Race and Ethnicity Standards )
Resource guide for looking up controlled vocabulary
Libraries develop research guides that may suggest subject headings or terms for searching.
- Racism (Researching Racism): Subject Headings: Racial justice - Racism (Arizona State University)
- Conducting research through an anti-racism lens (University of Michigan)
- Diverse Voice: Collections, Series, and Subject Headings (Princeton)
These guides can be found by entering suggested controlled vocabulary and search terms in Google with the term “libguide(s)” or limit site to libguides.com or .edu
Examples of Google search queries:
Suggested controlled vocabulary and search terms
This is a starter list and is not exhaustive nor comprehensive.
- anti-discrimination, antidiscrimination
- anti-racism, antiracism
- antiracist, anti-racist
- Black lives matter
- criminal justice
- critical race
- discriminate, discrimination
- disproportionate, disproportional, disproportionality
- hate crime
- health disparities, health equity, social determinants of health
- race-based epistemologies
- “racial disparity(ies)”
- raciail prejudice
- “racial justice”, “racial equity”
- “social inequality(ies)”
- “social justice”
- Imprisonment, incarceration
- mass incarceration
- police misconduct
- police shootings
- White supremacy, anti-blackness
- ethnic group(s)
- Racialized, racialization, ethnicization
- black or “african american”
- Indians of North America
- [names of particular racial or ethnic groups]. Look up in LCSH demographic/ethnic group headings (PDF)
Law, Policy, Structure
- institutional racism
- structural racism
- structure, structural
- systemic racism
- affirmative action
- Reparations, African Americans-Reparations
- Covenant (law), Real Covenant, Restrictive covenant
- Segregation, Jim Crow
- Civil rights, civil liberties
- Racial profiling, Stop and frisk, Driving while black(brown)
- (Indigenous) Data sovereignty
Data and Methodology
- dataset(s), data set(s)
- Interviews, oral histories, ethnographies, case studies
- statistics, statistical
- indicator(s), community indicator(s)
- Community Participatory Action Research
Notes for Data: The above list is more numeric oriented. Other data types include laws and regulation, images and photographs, artworks, contracts and other legal documents, court decisions, articles, news articles, advertisements, maps, and more.
Use ELSST – European Language Social Science Thesaurus to find the above and other related terms in European languages.
Examples of search queries:
For library catalog search queries, refer to the examples above under library catalogs.
Google Search queries using suggested terms with Google operators and limiters :
intitle:(“Mass incarceration” OR imprisonment OR incarceration) (“black OR “African Americans” (dataset OR microdata) site:.gov OR site:.eduParentheses are added for clarity but they are ignored by Google.
(ethnic OR race) “latin america” AROUND(5) microdata
race microdata site:.jp, or for India,
race microdata site:.in(list of country code extensions)
Google Dataset Search query (can omit data and site):
race “South Africa” microdata
4. Guidance for collecting data
There are many methods to collect or generate data while conducting research as a Principal Investigator of a research project. Conducting research is done by the investigators, statisticians, analysts, and visualization designers when thinking of datasets. The investigators define the research question and decide on the parameters of the collection. All is customized to the investigator’s research question.
Most people will reuse datasets that were not designed with all the variables, time series, or geographic preferences desired, which is why it is very important to read the methodology of every dataset you reuse if you did not generate the data. Pay great attention to the methodology of the research, to learn how the researcher defines ethnic, racial, and indigenous identities; such as how OECD countries collect data on ethnic, racial and indigenous identity .
This is guidance on collecting datasets generated by others to use in analysis. The question you must start with is who owns the data you want to use? Data is free when in public domain OR it is licensed using terms and conditions by the owners of the data with an open access (free) and fee-based (subscription required) license. Be very clear before you download or copy datasets to review what are the conditions to reuse that data.
No restrictions to reuse not covered by copyright. In the U.S. works are in pubic domain because: (1) the copyright expires, (2) failure to properly renew a copyright, (3) the work is placed in the public domain deliberately by the copyright owner (Sec. 105 ), and (4) the work was not of a type that can be protected by copyright (Sec. 102b ). Most datasets from the U.S. Governments are in the public domain. Read methodology to learn about how it was collected or frequency of the data. Some IGO, Countries, and subnational also release their data in the public domain. It is more common to see data release as Open Access Data.
In this guide: Sources and strategies to access public data: Governmental Sources.
Open access data license
You will have to agree to terms and conditions to access and use these data sources, sometimes having to pay a nominal amount of money. According to the OpenDataHandbook , open data is
“Availability and Access: the data must be available as a whole as and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form. Re-use and Redistribution: the data must be provided under terms that permit re-use and redistribution including the intermixing with other datasets. Universal Participation: everyone must be able to use, re-use and redistribute - there should be no discrimination against fields of endeavor or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed.”
Subscription required data license
You will need to agree to terms and conditions and pay money to access and use these data sources. Most of the time the researcher who oversaw the collection of data will not have any rights because they collected it as a work for hire. Generally, the data is controlled by a publisher/owner of the data, which means you have to negotiate access and treatment of original, derived, and usage data. Access may be available across an institution paying for it or restricted to one user by username and password. Even the format might be negotiated: remote access over the internet or on proprietary media with the publisher’s own software to be accessed on one agreed-upon computer. The publisher may require the licensee to maintain the confidentiality of its data, restrict how the data can be used, and even own any new data sets resulting from licensee research or limit the use of any analyst as a result of the data. This is why it is important to read the whole terms and conditions of the contract before you sign or agree to any licensed data.
In this guide: Sources and strategies to access subscription required data: Commercial Databases and Social Networks: Social Media, Community Listservs, Professional Associations
- Eaker, Christopher. (2021) Open Research Toolkit. Retrieved from https://doi.org/10.17605/OSF.IO/A4FTW
- Boettcher, Jennifer and Dames, K. Matthew. “Government Data as Intellectual Property: Is Public Domain the Same as Open Access? ” Online Searcher 42, no. 4, (July/August 2018): 42-48.
- From Digital Curation Centre (DCC) - UK. Ball, A. (2014). “How to License Research Data ”. DCC How-to Guides.
- From Oregon State University - IP & Licensing Data
- Essays on How Different Countries View Race (IASSIST)
5. Guidance for analyzing data
When analyzing data, the researcher needs to understand that data is not inherently objective, since it is collected, analyzed, and explained by people with biases. In order to interrupt our own biases, we can educate ourselves on the history of traditional statistics and data collection practices, learn how critical race theory can inform our quantitative methods practices, and learn about best practices such as data disaggregation and the CARE Principles.
- Ethics and Best Practices section of this guide
- Tools and information for anti-racism across data-research lifecycle (in .csv format)
- Nautilus Science Magazine: How Eugenics Shaped Statistics
- Minnesota Compass Project: Race data disaggregation: What does it mean? Why does it matter?
[ Back to Anti-Racism Resources Guide main page ]