Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 13:13:876686.
doi: 10.3389/fgene.2022.876686. eCollection 2022.

A Sequence Obfuscation Method for Protecting Personal Genomic Privacy

Affiliations

A Sequence Obfuscation Method for Protecting Personal Genomic Privacy

Shibiao Wan et al. Front Genet. .

Abstract

With the technological advances in recent decades, determining whole genome sequencing of a person has become feasible and affordable. As a result, large-scale individual genomic sequences are produced and collected for genetic medical diagnoses and cancer drug discovery, which, however, simultaneously poses serious challenges to the protection of personal genomic privacy. It is highly urgent to develop methods which make the personal genomic data both utilizable and confidential. Existing genomic privacy-protection methods are either time-consuming for encryption or with low accuracy of data recovery. To tackle these problems, this paper proposes a sequence similarity-based obfuscation method, namely IterMegaBLAST, for fast and reliable protection of personal genomic privacy. Specifically, given a randomly selected sequence from a dataset of genomic sequences, we first use MegaBLAST to find its most similar sequence from the dataset. These two aligned sequences form a cluster, for which an obfuscated sequence was generated via a DNA generalization lattice scheme. These procedures are iteratively performed until all of the sequences in the dataset are clustered and their obfuscated sequences are generated. Experimental results on benchmark datasets demonstrate that under the same degree of anonymity, IterMegaBLAST significantly outperforms existing state-of-the-art approaches in terms of both utility accuracy and time complexity.

Keywords: DNA generalization lattice; IterMegaBLAST; MegaBLAST; clustering; genomic privacy; machine learning; obfuscation methods; sequence similarity.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Current status of personal genomic data for utilization and privacy. (A) The whole genome sequencing (WGS) cost decreased significantly with the technological advances in recent decades. (B) The number of COVID-19 cases increased significantly in these 2 years and concurrently the number of personal genomic data would increase. (C) Large-scale projects have been launched for betterment of human healthcare while simultaneously posing serious challenges on protecting individual genomic privacy. “All of Us” (https://allofus.nih.gov/) was launched by US and “1 + Million Genomes” (https://digital-strategy.ec.europa.eu/en/policies/1-million-genomes) was initialized by the European Union. Blue circles represent good benefits of genomic data utilization or privacy well proteted, whereas the red circle represents the challenge of genomic data privacy breached.
FIGURE 2
FIGURE 2
The generalization hierarchy (Malin, 2005b) for sequence or nucleotide obfuscation. Note that lev is the level of corresponding nucleotides and the symbol “-” represents the gap.
FIGURE 3
FIGURE 3
Statistics of the two datasets used in this paper. (A) and (B): The density distribution of the sequence lengths for Dataset I (A) and Dataset II (B). (C,D): Distributions of the percentages of each nucleotide (i.e., A, C, G, and T) for Dataset I (C) and Dataset II (D). The numbers of sequences for Dataset I and Dataset II are 56 and 372, respectively.
FIGURE 4
FIGURE 4
The average distances of IterMegaBLAST varying with respect to the number of DNA sequences for (A) Dataset I and (B) Dataset II. The shorter the distance is, the less the information loss. DNALA is from (Malin, 2005b), while MWM, Hybrid and Online algorithms are from (Li et al., 2012). IterMegaBLAST is the method proposed in this paper.
FIGURE 5
FIGURE 5
An example of how IterMegaBLAST works. IterMegaBLAST consists of two major steps: sequence alignment and clustering, and sequence obfuscation. Given a query sequence, e.g., LN∣AF392171∣GI∣18029617, IterMegaBLAST first uses MegaBLAST to find its top homolog, e.g., LN∣AF392284∣GI∣18029730, which form a cluster. Then, IterMegaBLAST uses the sequence obfuscation method introduced in Section 2.4 to generate the generalized sequence (see the “Sequence info” box on the right panel) for this cluster. The distance is calculated and the related meta information is produced (see the “Meta info” box on the right panel). The red circles indicate the mismatched nucleotides between the query sequence and the homolog, and the blue circles represent the generalized nucleotides for the mismatches nucleotides.

References

    1. Al M., Wan S., Kung S.-Y. (2017). Ratio Utility and Cost Analysis for Privacy Preserving Subspace Projection. arXiv preprint arXiv:1702.07976
    1. Altschul S., Madden T. L., Schaffer A. A., Zhang J., Zhang Z., Miller W., et al. (1997). Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Res. 25, 3389–3402. 10.1093/nar/25.17.3389 - DOI - PMC - PubMed
    1. Carpov S., Gama N., Georgieva M., Jetchev D. (2021). Genoppml–a Framework for Genomic Privacy-Preserving Machine Learning. Cryptology ePrint Archive.
    1. Chen J., Wang W. H., Shi X. (2020). “Differential Privacy protection against Membership Inference Attack on Machine Learning for Genomic Data,” in BIOCOMPUTING 2021: Proceedings of the Pacific Symposium (World Scientific; ), 26–37. 10.1142/9789811232701_0003 - DOI - PubMed
    1. Chute C. G., Kohane I. S. (2013). Genomic Medicine, Health Information Technology, and Patient Care. JAMA 309, 1467–1468. 10.1001/jama.2013.1414 - DOI - PMC - PubMed