Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 18;16(1):825.
doi: 10.1038/s41467-025-56077-5.

Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data

Affiliations

Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data

Caitlin Guccione et al. Nat Commun. .

Abstract

As next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases and inadvertently permit flow-through of host-specific DNA during bioinformatic analyses, which could be exploited for individual identification. To address these issues, we introduce and benchmark three host filtration methods of varying throughput, with concomitant applications across low biomass samples such as skin and high microbial biomass datasets including fecal samples. We find that these methods are important for obtaining accurate results in low biomass samples (e.g., tissue, skin). Overall, we demonstrate that rigorous host filtration is a key component of privacy-minded analyses of patient microbiomes and provide computationally efficient pipelines for accomplishing this task on large-scale datasets.

PubMed Disclaimer

Conflict of interest statement

Competing interests: D.M. is a consultant for BiomeSense, Inc., has equity and receives income. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. G.H. is the recipient of the Robert A. Winn Diversity in Clinical Trials: Career Development Award, which is partly funded by Bristol-Meyer Squibb Foundation. B.L. is the owner of InOrder Labs LLC. K.C. has research grant support from Phathom Pharmaceuticals. R.K. is a scientific advisory board member, and consultant for BiomeSense, Inc., has equity and receives income. He is a scientific advisory board member and has equity in GenCirq. He is a consultant for DayTwo, and receives income. He has equity in and acts as a consultant for Cybele. He is a co-founder of Biota, Inc., and has equity. He is a cofounder of Micronoma, and has equity and is a scientific advisory board member. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Sex biases identified in inadequately host-filtered human tumor tissue data.
a RPCA of microbial relative abundance quantification from tumor samples in the Hartwig Medical Foundation Database, which was originally subject to GRCh38.p7 filtration exclusively. Statistically significant differences were found between male and female groups (PERMANOVA; pseudo-F = 65.4, p = 0.00025). b Identical dataset and pre-processing steps done in a but with the addition of the T2T-CHM13v2.0 reference genome in host filtration. Differences were not statistically significant between male and female groups (PERMANOVA; pseudo-F = 1.23, p = 0.29).
Fig. 2
Fig. 2. Host filtration pipeline and runtime evaluation.
a Pipeline of host filtration methods. b Using simulated data with a 50/50 mix of human data from HPRC and microbial data from FDA-ARGOS, we applied the 3 host filtration methods with 3 different sample sizes. Runtimes were averaged across 10 runs per sample size. HG38: GRCH38.p14, T2T: T2T-CHM13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release.
Fig. 3
Fig. 3. Host filtration pipeline simulated data validation.
Using the 10 simulated datasets of 1 million reads as described in Fig. 2b, we a calculated the number of human reads remaining, and b number of microbial reads remaining, for host filtration Methods 1–3 (HPRC host filtration performed excluding the 10 genomes used for data simulation). HG38: GRCH38.p14, T2T: T2T-CHM13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release. Box plots show the median (center line), interquartile range (IQR; Q1–Q3; box), whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, minimum and maximum values at whisker ends, and points representing individual observations both within and beyond the whisker range.
Fig. 4
Fig. 4. Comparing human exome and tumor tissue samples across host filtration methods.
a The number of reads remaining after host-filtering 30 human exomes subset to 1 million reads across methods. b 100 metastatic colorectal cancer tissue samples were selected from HMF and read counts were calculated following application of improved host filtration methods. HG38 GRCH38.p14, T2T T2T-CHM13v2.0, HPRC Human Pangenome Reference Consortium 2024 release. Box plots show the median (center line), interquartile range (IQR; Q1–Q3; box), whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, minimum and maximum values at whisker ends, and points representing individual observations both within and beyond the whisker range.
Fig. 5
Fig. 5. Comparing human skin and fecal samples across host filtration methods.
a 87 human skin samples were host-filtered with the improved methods, we then calculated the percentage of reads remaining. b We calculated the percentage of reads remaining on a per-sample basis for each of the 50 human fecal samples examined. HG38: GRCH38.p14, T2T: T2T-CHM13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release. Box plots show the median (center line), interquartile range (IQR; Q1–Q3; box), and whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR. Box plots show the median (center line), interquartile range (IQR; Q1–Q3; box), whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, minimum and maximum values at whisker ends, and points representing individual observations both within and beyond the whisker range.
Fig. 6
Fig. 6. Re-identification from a set of genotype data based on the human reads in fecal samples prevented with improved host filtration.
The 343 fecal samples from Tomofuji et al. Nature Microbiology 2023, with paired genotype data, were re-analyzed with various combinations of updated host filtration methods (GRCh38.p14, T2T-CHM13v2.0, Human Pangenome Reference Consortium 2024 release) resolving host data leakage. The x-axis of the plots indicates the number of bases used for the calculation of the likelihood scores. The y-axis of the plot indicates the two-sided P values calculated using a standard normal distribution based on the standardized likelihood scores. The red and blue dashed lines indicate p = 4.3 × 10−7 (0.05/117,649 tests) and p = 1.5 × 10−4 (0.05/343 tests), respectively. The results of the 117,649 tests (343 genotype data × 343 metagenome data) are indicated as the colors of the points. Some samples could not be used for the re-identification analysis because too few reads remained after filtering, hence the fewer dots shown across host filtration methods. Full description on the calculation of P values can be found in the Methods.

Update of

References

    1. Chiu, C. Y. & Miller, S. A. Clinical metagenomics. Nat. Rev. Genet.20, 341–355 (2019). - DOI - PMC - PubMed
    1. Han, D. et al. The Real-World Clinical Impact of Plasma mNGS Testing: an Observational Study. Microbiol. Spectr.10.1128/spectrum.03983-22 (2023). - DOI - PMC - PubMed
    1. Ren, Y. et al. Prediction of antimicrobial resistance based on whole-genome sequencing and machine learning. Bioinformatics38, 325–334 (2021). - DOI - PMC - PubMed
    1. Marotz, C. A. et al. Improving saliva shotgun metagenomics by chemical host DNA depletion. Microbiome6, 42 (2018). - DOI - PMC - PubMed
    1. Payne, A. et al. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nat. Biotechnol.39, 442–450 (2021). - DOI - PMC - PubMed