Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Jul;23(7):429-445.
doi: 10.1038/s41576-022-00455-y. Epub 2022 Mar 4.

Sociotechnical safeguards for genomic data privacy

Affiliations
Review

Sociotechnical safeguards for genomic data privacy

Zhiyu Wan et al. Nat Rev Genet. 2022 Jul.

Erratum in

Abstract

Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. An overview of privacy intrusions and safeguards in genomic data flows.
The four routes of genomic data flow (as indicated by the arrow colours) represent four settings in which data are used or shared: health care (red), research (gold), direct-to-consumer (DTC; green) and forensic (dark blue). The grey line represents a combination of the first three settings. In the health-care setting, data collected by a health-care entity (for example, Vanderbilt University Medical Center) are protected by the Genetic Information Nondiscrimination Act of 2008 (GINA) and the Health Insurance Portability and Accountability Act of 1996 (HIPAA), for primary uses. In the research setting, data collected by a research entity (for example, 1000 Genomes Project, Electronic Medical Records and Genomics (eMERGE) network or All of Us Research Program) are primarily protected by the Common Rule, for primary uses and protected by the US National Institutes of Health (NIH) data sharing policy, for secondary uses. In the DTC setting, data collected by a DTC entity are protected by the European Union’s General Data Protection Regulation (GDPR) and/or the US state privacy laws (for example, California Consumer Privacy Act, California Privacy Rights Act or Virginia Consumer Data Protection Act) for primary uses and protected by self-regulation (for example, data use agreements, privacy policies or terms of service) for secondary uses. In the forensic setting, data shared with law enforcement are protected by informed consent. A first party refers to the individual to whom the data correspond, whereas a second party refers to the organization (or individual) who collects and/or uses the data for a purpose that the first party is made aware of. By contrast, third parties refer to users (or recipients) of data who have the ability to communicate with the second party only and might include malicious attackers. Examples of third parties include researchers who access data from an existing research study or a pharmaceutical company that partners with a DTC genetic testing company. The data flow from a DTC entity to a research entity is represented by the arrow at the bottom. Confidentiality is mostly concerned when data are being used, whereas anonymity and solitude are mostly concerned when data are being shared. Specifically, cryptographic tools protect confidentiality against unauthorized access attacks, whereas access control and data perturbation approaches protect anonymity against privacy intrusions such as re-identification and membership inference attacks. We simplify the figure by omitting the impacts of GDPR and data use agreements in the research setting.
Fig. 2
Fig. 2. Data perturbation approaches for privacy protection in genomic data sharing.
Each module (or submodule) can work independently to protect data as shown by the corresponding data flow. In the transformation module, data can be masked, generalized and/or suppressed according to a privacy protection model (for example, k-anonymity). In the aggregation module, data can be aggregated to summary statistics or parameters in a machine learning (ML) model. In the module of synthetic data generation, a synthetic data set can be generated using a generative adversarial network (GAN). In the obfuscation module, noise can be added to data using a privacy protection model (for example, differential privacy). All contents in each module (or submodule) are examples for illustration purposes only. In the example for the generalization submodule, the plus sign represents a generalization of values one and two for a genomic attribute. In the example for the submodule of summary statistics, the minor allele frequency for each single-nucleotide polymorphism (SNP) marker is computed for each group of individual records. (n represents the number of records in the group; xi represents the value of a genomic attribute for the ith record in a group, which is the number of minor alleles at a SNP position for a record in this example.) In the example for the submodule of ML models, the neural network with three layers has 21 parameters (that is, 16 weights and 5 biases) that need to be learned. In the example for the GAN submodule, X represents the input data set, G represents the generator network and D represents the discriminator network. In the example for the reconstruction attack in the module of risk assessment, the attacker tries to reconstruct the original data set by linkage and inference, and the privacy risk is assessed by the data sharer using a distance function. In the example for the membership inference attack in the module of risk assessment, the attacker tries to infer the membership of each targeted individual by hypothesis testing, and the privacy risk is assessed by the data sharer using a function that measures the test’s accuracy. The reconstruction attack and the membership inference attack are used here for illustration purposes only and could be replaced with any other attack (for example, a re-identification attack or a familial search attack) or some arbitrary combination of attacks. Data can be sequentially protected by multiple modules and submodules before the privacy risk is mitigated to an acceptable level and finally released. r represents the privacy risk; d represents the distance function; f represents the function measures accuracy; θ represents the threshold for the privacy risk.
Fig. 3
Fig. 3. Cryptographic approaches for privacy protection in the use of genomic data.
a | Homomorphic encryption enables computation by a third party on encrypted data without decrypting any specific record. In this instance, it is applied to a genome-wide association study and a disease susceptibility test. b | Secure multiparty computation enables multiple parties to jointly compute a function of their inputs without revealing inputs. Here, three institutions share encrypted data to third parties for summary statistics (for example, minor allele frequency (MAF)) computing. c | A trusted execution environment, such as Intel Software Guard Extensions (SGX), isolates the computation process in an encrypted enclave using central processing unit (CPU) support so that even malicious operating system software cannot see the enclave contents. Here, an institution computes summary statistics (for example, MAF) in a secure enclave of a third party. d | A blockchain enables encrypted immutable records stored on a decentralized network. Here, the individual manages the decryption key using a blockchain while sharing encrypted data with researchers. Avg., average; RAM, random-access memory; SNP, single-nucleotide polymorphism.

References

    1. Garrison NA. Genomic justice for Native Americans: impact of the Havasupai case on genetic research. Sci. Technol. Hum. Values. 2013;38:201–223. doi: 10.1177/0162243912470009. - DOI - PMC - PubMed
    1. Spector-Bagdady K, et al. “My research is their business, but I’m not their business”: patient and clinician perspectives on commercialization of precision oncology data. Oncologist. 2020;25:620–626. doi: 10.1634/theoncologist.2019-0863. - DOI - PMC - PubMed
    1. Clayton EW, Halverson CM, Sathe NA, Malin BA. A systematic literature review of individuals’ perspectives on privacy and genetic information in the United States. PLoS ONE. 2018;13:e0204417. doi: 10.1371/journal.pone.0204417. - DOI - PMC - PubMed
    1. Doe, G. With genetic testing, I gave my parents the gift of divorce. Voxhttps://www.vox.com/2014/9/9/5975653/with-genetic-testing-i-gave-my-pare... (2014).
    1. Copeland, L. The Lost Family: How DNA Testing is Upending Who We Are (Abrams, 2020).

Publication types