Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data
- PMID: 39827261
- PMCID: PMC11742726
- DOI: 10.1038/s41467-025-56077-5
Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data
Abstract
As next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases and inadvertently permit flow-through of host-specific DNA during bioinformatic analyses, which could be exploited for individual identification. To address these issues, we introduce and benchmark three host filtration methods of varying throughput, with concomitant applications across low biomass samples such as skin and high microbial biomass datasets including fecal samples. We find that these methods are important for obtaining accurate results in low biomass samples (e.g., tissue, skin). Overall, we demonstrate that rigorous host filtration is a key component of privacy-minded analyses of patient microbiomes and provide computationally efficient pipelines for accomplishing this task on large-scale datasets.
© 2025. The Author(s).
Conflict of interest statement
Competing interests: D.M. is a consultant for BiomeSense, Inc., has equity and receives income. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. G.H. is the recipient of the Robert A. Winn Diversity in Clinical Trials: Career Development Award, which is partly funded by Bristol-Meyer Squibb Foundation. B.L. is the owner of InOrder Labs LLC. K.C. has research grant support from Phathom Pharmaceuticals. R.K. is a scientific advisory board member, and consultant for BiomeSense, Inc., has equity and receives income. He is a scientific advisory board member and has equity in GenCirq. He is a consultant for DayTwo, and receives income. He has equity in and acts as a consultant for Cybele. He is a co-founder of Biota, Inc., and has equity. He is a cofounder of Micronoma, and has equity and is a scientific advisory board member. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. The remaining authors declare no competing interests.
Figures
Update of
-
Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data.Res Sq [Preprint]. 2024 Oct 23:rs.3.rs-4721159. doi: 10.21203/rs.3.rs-4721159/v1. Res Sq. 2024. Update in: Nat Commun. 2025 Jan 18;16(1):825. doi: 10.1038/s41467-025-56077-5. PMID: 39502785 Free PMC article. Updated. Preprint.
References
MeSH terms
Grants and funding
- R01 CA241728/CA/NCI NIH HHS/United States
- DP1 AT010885/AT/NCCIH NIH HHS/United States
- R01 CA270235/CA/NCI NIH HHS/United States
- AGA Research Scholar Award AGA2022-13-05/AGA Research Foundation
- NIH/NIGMS T32GM007198/U.S. Department of Health & Human Services | National Institutes of Health (NIH)
- R21 HG013433/HG/NHGRI NIH HHS/United States
- T32 GM007198/GM/NIGMS NIH HHS/United States
- CDC award 75D301-22-C-14717/U.S. Department of Health & Human Services | Centers for Disease Control and Prevention (CDC)
- NIH Pioneer DP1AT010885/U.S. Department of Health & Human Services | National Institutes of Health (NIH)
- U19 AG063744/AG/NIA NIH HHS/United States
- NCI U24CA248454/U.S. Department of Health & Human Services | NIH | National Cancer Institute (NCI)
- P30 DK120515/DK/NIDDK NIH HHS/United States
- P30 CA023100/CA/NCI NIH HHS/United States
- U24 CA248454/CA/NCI NIH HHS/United States
