Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 1;26(15):7432.
doi: 10.3390/ijms26157432.

A Natural Language Processing Method Identifies an Association Between Bacterial Communities in the Upper Genital Tract and Ovarian Cancer

Affiliations

A Natural Language Processing Method Identifies an Association Between Bacterial Communities in the Upper Genital Tract and Ovarian Cancer

Andrew Polio et al. Int J Mol Sci. .

Abstract

Bacterial communities within the female upper genital tract may influence the risk of ovarian cancer. In this retrospective cohort pilot study, we aim to detect different communities of bacteria between ovarian cancer and normal controls using topic modeling, a natural language processing tool. RNA was extracted and analyzed using the VITCOMIC2 pipeline. Topic modeling assessed differences in bacterial communities. Idatuning identified an optimal latent topic number and Latent Dirichlet Allocation (LDA) assessed topic differences between high-grade serous ovarian cancer (HGSOC) and controls. Results were validated using The Cancer Genome Atlas (TCGA) HGSOC dataset. A total of 801 unique taxa were identified, with 13 bacteria significantly differing between HGSOC and normal controls. LDA modeling revealed a latent topic associated with HGSOC samples, containing bacteria Escherichia/Shigella and Corynebacterineae. Pathway analysis using KEGG databases suggest differences in several biologic pathways including oocyte meiosis, aldosterone-regulated sodium reabsorption, gastric acid secretion, and long-term potentiation. These findings support the hypothesis that bacterial communities in the upper female genital tract may influence the development of HGSOC by altering the local environment, with potential functional implications between HGSOC and normal controls. However, further validation is required to confirms these associations and determine mechanistic relevance.

Keywords: RNA sequencing; RNAseq; microbiome; natural language processing; ovarian cancer; prediction model.

PubMed Disclaimer

Conflict of interest statement

All the authors have nothing to disclose. This does not alter our adherence to the journal policies on sharing data and materials.

Figures

Figure 1
Figure 1
Patient population. Normal fallopian tube samples from patients with no risk factors and no personal/family history of ovarian cancer. Out of the 20 samples, 12 were suitable and were sequenced. Samples were from HGSOC patients that underwent surgical intervention at the University of Iowa Hospitals and Clinics and had their tumors sequenced.
Figure 2
Figure 2
Comparison of 16S RNA gene expression between HGSOC and control samples. (A) Heatmap of the normalized 30 most frequent bacterial 16S RNA counts found by RNAseq for HGSOC and control samples. Abundance counts are represented in green. Analysis was performed by phyloseq R package (v 4.4.1). (B) Heatmap of the 16S RNA log2 transformed normalized counts found by RNAseq between HGSOC and control samples that were different following univariate analysis, N = 13. Analysis was performed with DESeq2 R package. 16S RNAlog2 transformed expression is represented in a blue–red scale.
Figure 3
Figure 3
Analysis to identify an optimal latent of topic numbers in the cohort. The FindTopicNumber function from the Idatuning (v 1.0.3) package was used to identify an optimal latent number topic using both minimization (CoaJuan2009, Arun2010) and maximalization (Griffiths, Deveaud2014) metrics. Based on these metrics #83 was selected as the model to proceed. On the horizontal axis, number of topics tested, from 0 to 120. On the vertical axis the percentage of variation.
Figure 4
Figure 4
Topic modeling using Latent Dirichlet Allocation (LDA). LDA is a natural language processing tool for topic modeling that assesses for differentially abundant topics between HGSOC and control samples. Left panel: topic #81 demonstrates positive log2 fold changes (>1) with over 9-fold change between cancer and control samples, and with a significant FDR-adjusted p-value (p < 0.05). Right panel: plotting per-topic bacterial (vertical axis) probabilities (horizontal axis).
Figure 5
Figure 5
KEGG pathway differences between HGSOC and normal controls. Multiple signaling pathways are found to be significantly different including environmental information processing, metabolic, and organismal systems. Mid-panel vertical axis: KEGG name of the significant pathways; lower axis: relative gene expression abundance (log2 transformed) in the described pathways. Right panel: log2 fold change with direction; negative: less in cancer than in normal; positive: more in cancer than in normal. Right-side, adjusted p-value of the difference.
Figure 6
Figure 6
Analysis to identify an optimal latent of topic numbers in the TCGA cohort. The FindTopicNumber function from the Idatuning (v 1.0.3) package was used to identify an optimal latent number topic using both minimization (CoaJuan2009, Arun2010) and maximalization (Griffiths, Deveaud2014) metrics. Based on these metrics #43 was selected as the optimal topic number to proceed. On the horizontal axis is the number of topics tested, from 0 to 120. On the vertical axis is the percentage of variation.
Figure 7
Figure 7
Topic modeling using Latent Dirichlet Allocation (LDA). LDA topic modeling in the TCGA dataset. Left panel: topics #19 and #36 demonstrate positive log2 fold changes (>1), and topic #16 demonstrates negative log2 fold changes between cancer and control samples. All three have significant FDR-adjusted p-values (p < 0.05). Right panel: plotting bacterial (vertical axis) probabilities (horizontal axis) of topic #36.

Similar articles

References

    1. Aggarwal N., Kitano S., Puah G.R.Y., Kittelmann S., Hwang I.Y., Chang M.W. Microbiome and Human Health: Current Understanding, Engineering, and Enabling Technologies. Chem. Rev. 2022;123:31. doi: 10.1021/acs.chemrev.2c00431. - DOI - PMC - PubMed
    1. Madhogaria B., Bhowmik P., Kundu A. Correlation between human gut microbiome and diseases. Infect. Med. 2022;1:180–191. doi: 10.1016/j.imj.2022.08.004. - DOI - PMC - PubMed
    1. Li C., Feng Y., Yang C., Wang D., Zhang D., Luo X., Zhang H., Huang H., Zhang H., Jiang Y., et al. Association between vaginal microbiota and the progression of ovarian cancer. J. Med. Virol. 2023;95:e28898. doi: 10.1002/jmv.28898. - DOI - PubMed
    1. Laniewski P., Ilhan Z.E., Herbst-Kralovetz M.M. The microbiome and gynaecological cancer development, prevention and therapy. Nat. Rev. Urol. 2020;17:232–250. doi: 10.1038/s41585-020-0286-z. - DOI - PMC - PubMed
    1. Chambers L.M., Bussies P., Vargas R., Esakov E., Tewari S., Reizes O., Michener C. The Microbiome and Gynecologic Cancer: Current Evidence and Future Opportunities. Curr. Oncol. Rep. 2021;23:92. doi: 10.1007/s11912-021-01079-x. - DOI - PubMed

LinkOut - more resources