Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Oct 23:rs.3.rs-4721159.
doi: 10.21203/rs.3.rs-4721159/v1.

Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data

Affiliations

Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data

Caitlin Guccione et al. Res Sq. .

Update in

Abstract

As next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases and inadvertently permit flow-through of host-specific DNA during bioinformatic analyses, which could be exploited for individual identification. To address these issues, we introduce and benchmark three host filtration methods of varying throughput, with concomitant applications across low biomass samples such as skin and high microbial biomass datasets including fecal samples. We find that these methods are important for obtaining accurate results in low biomass samples (e.g., tissue, skin). Overall, we demonstrate that rigorous host filtration is a key component of privacy-minded analyses of patient microbiomes and provide computationally efficient pipelines for accomplishing this task on large-scale datasets.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Sex biases found in inadequately filtered human tumor tissue data
(a) RPCA-PCoA of original microbial abundance information from tumor samples in HMF, which was generated by exclusively GRCh38.p7 filtration. Statistically significant differences were found between male and female groups. (b) Identical dataset and pre-processing steps done in (a) but with the addition of the T2T-CHM13v2.0 reference genome in host filtration. Differences were not statistically significant between male and female groups.
Figure 2.
Figure 2.. Host filtration pipeline and runtime evaluation
(a) Pipeline of host filtration methods. (b) Using simulated data with a 50/50 mix of human data from HPRC and microbial data from FDA-ARGOS, we ran the 3 host filtration methods with 3 different sample sizes. Runtimes were averaged across 10 runs per sample size. HG38: GRCH38.p14, T2T: T2T-CHM13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release.
Figure 3.
Figure 3.. Host filtration pipeline simulated data validation
Using the 10 simulated datasets of 1 million reads as described in Figure 2b (a) calculated the number of human reads remaining, and (b) number of microbial reads remaining, for host filtration Methods 1-3 (HPRC host filtration done excluding the 10 pangenomes simulated on). HG38 : GRCH38.p14, T2T: T2T-CHM13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release.
Figure 4.
Figure 4.. Comparing human exome and tumor tissue samples across host filtration methods
(a) The number of reads remaining after host read filtering 30 human exomes subset to 1 million reads across methods. (b) 100 metastatic colorectal cancer tissue samples were selected from HMF and read counts were calculated following application of updated host filtration methods. HG38 : GRCH38.p14, T2T: T2T-CHM 13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release.
Figure 5.
Figure 5.. Comparing human skin and fecal samples across host filtration methods
(a) 87 human skin samples were host filtered with the updated methods, we then calculated the percentage of reads remaining. (b) We calculated the percentage of reads remaining on a per-sample basis for each of the 50 human fecal samples examined. HG38 : GRCH38.p14, T2T: T2T-CHM13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release.
Figure 6.
Figure 6.. Re-identification from a set of genotype data based on the human reads in fecal samples prevented with proper host filtration
The 343 fecal samples from Tomofuji et al. Nature Microbiology 2023, with paired genotype data, were re-analyzed with various combinations of updated host filtration methods (GRCh38.p14, T2T-CHM 13v2.0, Human Pangenome Reference Consortium 2024 release) resolving host data leakage. The x-axis of the plots indicates the number of bases used for the calculation of the likelihood scores. The y-axis of the plot indicates the P-values. The red and blue dashed lines indicate P = 4.25 × 10−7 (0.05/ 117,649 tests) and P = 1.46 × 10−4 (0.05/343 tests), respectively. The results of the 117,649 tests (343 genotype data × 343 metagenome data) are indicated as the colors of the points. Some samples could not be used for the re-identification analysis because too few reads remained after filtering, hence fewer dots shown across host filtration methods. Full description on the calculation of P-values can be found in the Methods.

References

    1. Chiu C. Y. & Miller S. A. Clinical metagenomics. Nat. Rev. Genet. 20, 341–355 (2019). - PMC - PubMed
    1. Han D. et al. The Real-World Clinical Impact of Plasma mNGS Testing: an Observational Study. Microbiology Spectrum (2023) doi: 10.1128/spectrum.03983-22. - DOI - PMC - PubMed
    1. Ren Y. et al. Prediction of antimicrobial resistance based on whole-genome sequencing and machine learning. Bioinformatics 38, 325–334 (2021). - PMC - PubMed
    1. Marotz C. A. et al. Improving saliva shotgun metagenomics by chemical host DNA depletion. Microbiome 6, 42 (2018). - PMC - PubMed
    1. Payne A. et al. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nat. Biotechnol. 39, 442–450 (2021). - PMC - PubMed

Publication types

LinkOut - more resources