. 2004 Apr 6;101 Suppl 1(Suppl 1):5228-35.

doi: 10.1073/pnas.0307752101. Epub 2004 Feb 10.

Finding scientific topics

Thomas L Griffiths¹, Mark Steyvers

Affiliations

PMID: 14872004
PMCID: PMC387300
DOI: 10.1073/pnas.0307752101

Finding scientific topics

Thomas L Griffiths et al. Proc Natl Acad Sci U S A. 2004.

. 2004 Apr 6;101 Suppl 1(Suppl 1):5228-35.

doi: 10.1073/pnas.0307752101. Epub 2004 Feb 10.

Authors

Thomas L Griffiths¹, Mark Steyvers

Affiliation

¹ Department of Psychology, Stanford University, Stanford, CA 94305, USA. gruffydd@psych.stanford.edu

PMID: 14872004
PMCID: PMC387300
DOI: 10.1073/pnas.0307752101

Abstract

A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.

PubMed Disclaimer

Figures

**Fig. 1.**
(a) Graphical representation of 10 topics, combined to produce “documents” like those shown in b, where each image is the result of 100 samples from a unique mixture of these topics. (c) Performance of three algorithms on this dataset: variational Bayes (VB), expectation propagation (EP), and Gibbs sampling. Lower perplexity indicates better performance, with chance being a perplexity of 25. Estimates of the standard errors are smaller than the plot symbols, which mark 1, 5, 10, 20, 50, 100, 150, 200, 300, and 500 iterations.

**Fig. 2.**
Results of running the Gibbs sampling algorithm. The log-likelihood, shown on the left, stabilizes after a few hundred iterations. Traces of the log-likelihood are shown for all four runs, illustrating the consistency in values across runs. Each row of images on the right shows the estimates of the topics after a certain number of iterations within a single run, matching the points indicated on the left. These points correspond to 1, 2, 5, 10, 20, 50, 100, 150, 200, 300, and 500 iterations. The topics expressed in the data gradually emerge as the Markov chain approaches the posterior distribution.

**Fig. 3.**
Model selection results, showing the log-likelihood of the data for different settings of the number of topics, T. The estimated standard errors for each point were smaller than the plot symbols.

**Fig. 4.**
(*Upper*) Mean values of θ at each of the diagnostic topics for all 33 PNAS minor categories, computed by using all abstracts published in 2001. Higher probabilities are indicated with darker cells. (*Lower*) The five most probable words in the topics themselves listed in the same order as on the horizontal axis in *Upper*.

**Fig. 5.**
The plots show the dynamics of the three hottest and three coldest topics from 1991 to 2001, defined as those topics that showed the strongest positive and negative linear trends. The 12 most probable words in those topics are shown below the plots.

**Fig. 6.**
A PNAS abstract (18) tagged according to topic assignment. The superscripts indicate the topics to which individual words were assigned in a single sample, whereas the contrast level reflects the probability of a word being assigned to the most prevalent topic in the abstract, computed across samples.

See this image and copyright information in PMC

Cited by

Synonym, topic model and predicate-based query expansion for retrieving clinical documents.
Zeng QT, Redd D, Rindflesch T, Nebeker J. Zeng QT, et al. AMIA Annu Symp Proc. 2012;2012:1050-9. Epub 2012 Nov 3. AMIA Annu Symp Proc. 2012. PMID: 23304381 Free PMC article.
Language Bias in Health Research: External Factors That Influence Latent Language Patterns.
Valdez D, Goodson P. Valdez D, et al. Front Res Metr Anal. 2020 Aug 20;5:4. doi: 10.3389/frma.2020.00004. eCollection 2020. Front Res Metr Anal. 2020. PMID: 33870042 Free PMC article.
Decoding the Real-Time Neurobiological Properties of Incremental Semantic Interpretation.
Choi HS, Marslen-Wilson WD, Lyu B, Randall B, Tyler LK. Choi HS, et al. Cereb Cortex. 2021 Jan 1;31(1):233-247. doi: 10.1093/cercor/bhaa222. Cereb Cortex. 2021. PMID: 32869058 Free PMC article.
On the unsupervised analysis of domain-specific Chinese texts.
Deng K, Bol PK, Li KJ, Liu JS. Deng K, et al. Proc Natl Acad Sci U S A. 2016 May 31;113(22):6154-9. doi: 10.1073/pnas.1516510113. Epub 2016 May 16. Proc Natl Acad Sci U S A. 2016. PMID: 27185919 Free PMC article.
Prediction of complications in diabetes mellitus using machine learning models with transplanted topic model features.
Han BC, Kim J, Choi J. Han BC, et al. Biomed Eng Lett. 2023 Oct 6;14(1):163-171. doi: 10.1007/s13534-023-00322-7. eCollection 2024 Jan. Biomed Eng Lett. 2023. PMID: 38186952 Free PMC article.

See all "Cited by" articles

References

1. Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022.
1. Hofmann, T. (2001) Machine Learn. J. 42, 177-196.
1. Cohn, D. & Hofmann, T. (2001) in Advances in Neural Information Processing Systems 13 (MIT Press, Cambridge, MA), pp. 430-436.
1. Iyer, R. & Ostendorf, M. (1996) in Proceedings of the International Conference on Spoken Language Processing (Applied Science & Engineering Laboratories, Alfred I. duPont Inst., Wilmington, DE), Vol 1., pp. 236-239.
1. Bigi, B., De Mori, R., El-Beze, M. & Spriet, T. (1997) in 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings (IEEE, Piscataway, NJ), pp. 535-542.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Finding scientific topics

Affiliation

Finding scientific topics

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources