Full Program »
When to use the k-rule? Managing the risk of de-anonymisation in survey data
Ethical and legal considerations require anonymisation to protect respondents’ privacy when sharing survey data. When anonymising data, it is often not sufficient to eliminate direct identifiers, such as names, contact details, and IP-addresses. Also indirect identifiers need to be considered. Indirect identifiers can, in combination, be used to re-identify respondents, for example, the ZIP code combined with an exceptionally high income. Here, social science survey data impose increased challenges on anonymization. The demographic information included are often very detailed and increase the re-identification risk.
To manage this risk, social science data archives have processes in place to anonymize data or restrict data access. One strategy, k-anonymity, may help to protect respondents of certain surveys, but is not often discussed when anonymizing social science micro data. A dataset is k-anonymous “for k > 1 if, for each combination of key attributes, at least k records exist in the data set sharing that combination” (Domingo-Ferrer and Torra 2008, p. 991).
Unique individuals in survey data are most often not considered problematic. Among other factors, sampling procedures make it nearly impossible to rule out that data twins may potentially exist in the population. In this contribution we look at conditions where the protection through sampling is violated or weak and we identify criteria that do make the application of k-anonymity necessary. Doing so, we analyse different risk components within a risk assessment framework for a survey where the exact sample can be replicated and a survey that can potentially include public figures. We compare the risk factors of both survey with a large population survey in which protection through sampling is present.