Synthetic data fidelity: how less can be more

Synthetic data is generated rather than observed and includes stuff that someone makes up on the spot, random numbers generated by simplistic code, predictions made by complex machine learning models, the output of sophisticated digital twin simulations and much more. An important related concept, fidelity captures how “faithful” a synthetic data set is to a real-world counterpart. As such, fidelity is often seen as an important, if not the most important, feature of synthetic data.

Yet, fidelity is not binary; a synthetic data set can be very faithful in some ways while wildly unfaithful in others, with the specifics of its fidelity determining its usefulness. For example, if synthetic data is intended to fix gaps or biases in real-world data sets, then it must be deliberately unfaithful to the original in at least some specific ways. At the same time, not all synthetic data sets try to mimic, replicate or augment existing real-world data sets and may not even use any real-world data within the synthetic data generation process. As such, fidelity (and especially high fidelity) is not always as important as might be assumed. This talk introduces and defines what synthetic data is and is not before diving into the role of fidelity, before highlighting common use cases, generation methods and concerns around synthetic data at varying levels of fidelity. When used appropriately to link a method, data set and research question, synthetic data can provide a valuable alternative to real-world data in situations where real-world data is unavailable, restricted, or unknown. Importantly, synthetic data is especially useful to enhance reproducibility and transparency in research by balancing data utility against privacy protection as well as to facilitate hypothesis testing, method development.

Jools Kasmire
UK Data Service / University of Manchester
United Kingdom