Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Jul 31:2023.07.28.550993.
doi: 10.1101/2023.07.28.550993.

Major data analysis errors invalidate cancer microbiome findings

Affiliations

Major data analysis errors invalidate cancer microbiome findings

Abraham Gihawi et al. bioRxiv. .

Update in

Abstract

We re-analyzed the data from a recent large-scale study that reported strong correlations between microbial organisms and 33 different cancer types, and that created machine learning predictors with near-perfect accuracy at distinguishing among cancers. We found at least two fundamental flaws in the reported data and in the methods: (1) errors in the genome database and the associated computational methods led to millions of false positive findings of bacterial reads across all samples, largely because most of the sequences identified as bacteria were instead human; and (2) errors in transformation of the raw data created an artificial signature, even for microbes with no reads detected, tagging each tumor type with a distinct signal that the machine learning programs then used to create an apparently accurate classifier. Each of these problems invalidates the results, leading to the conclusion that the microbiome-based classifiers for identifying cancer presented in the study are entirely wrong. These flaws have subsequently affected more than a dozen additional published studies that used the same data and whose results are likely invalid as well.

PubMed Disclaimer

Conflict of interest statement

Conflicts of interest. CSC, DSB and AG are coinventors on a patent application (UK Patent Application No. 2200682.9) from the University of East Anglia/UEA Enterprises Limited regarding the application of biomarker bacterial genera in prostate cancer. All other authors declare no conflicts.

Figures

Figure 1.
Figure 1.
Average number of reads per sample in bladder cancer (BLCA) in the top 20 most-abundant genera reported in Poore et al. (left), averaged across 156 whole-genome sequencing samples. On the right are the counts for the same samples and same genera, in the same order, as computed in our re-analysis. Note that the y-axis scales are different by a factor of 2000. The x-axis shows genus names.
Figure 2.
Figure 2.
Distribution of normalized counts for Hepandensovirus for Adrenocortical carcinoma (blue) versus all other samples (orange). Inset shows zoomed-in view of the distribution for the smallest values. All raw values were zero.
Figure 3.
Figure 3.
Distribution of normalized counts for Thiorhodospira reads in kidney chromophobe (KICH) cancer (blue) and normal (orange) samples. Nearly all raw values were zero except for 7 samples with a raw count of 1.
Figure 4.
Figure 4.
Distribution of normalized read counts in the APCR data set for Nitrospira reads found in lung squamous cell carcinoma (blue) and all other cancer types (orange). For clarity, the y-axis is truncated at 500, but the peak of the distribution for other cancers (organe) is at 1389.
Figure 5.
Figure 5.
Distribution of normalized counts for Mulikevirus reads in head and neck squamous cell (HNSC) cancer (orange) and normal (blue) samples. All raw values were zero.
Figure 6.
Figure 6.
Accuracies for one-vs-all tumor classification models obtained from a selection of samples and genera with zero classified reads prior to normalization. Each row shows the accuracies of a classifier that distinguished one cancer type from all other cancer types in the table. AUC: maximum measured area under the sensitivity-specificity curve. PPV: positive predictive value. NPV: negative predictive value.

References

    1. Bosch FX, Lorincz A, Munoz N, Meijer CJ, Shah KV. 2002. The causal relation between human papillomavirus and cervical cancer. J Clin Pathol 55:244–65. - PMC - PubMed
    1. Warren JR, Marshall B. 1983. Unidentified curved bacilli on gastric epithelium in active chronic gastritis. Lancet 1:1273–5. - PubMed
    1. Castellarin M, Warren RL, Freeman JD, Dreolini L, Krzywinski M, Strauss J, Barnes R, Watson P, Allen-Vercoe E, Moore RA, Holt RA. 2012. Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. Genome Res 22:299–306. - PMC - PubMed
    1. Poore GD, Kopylova E, Zhu Q, Carpenter C, Fraraccio S, Wandro S, Kosciolek T, Janssen S, Metcalf J, Song SJ, Kanbar J, Miller-Montgomery S, Heaton R, McKay R, Patel SP, Swafford AD, Knight R. 2020. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579:567–574. - PMC - PubMed
    1. Breitwieser FP, Pertea M, Zimin AV, Salzberg SL. 2019. Human contamination in bacterial genomes has created thousands of spurious proteins. Genome Res 29:954–960. - PMC - PubMed

Publication types