Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Apr 6;101 Suppl 1(Suppl 1):5228-35.
doi: 10.1073/pnas.0307752101. Epub 2004 Feb 10.

Finding scientific topics

Affiliations

Finding scientific topics

Thomas L Griffiths et al. Proc Natl Acad Sci U S A. .

Abstract

A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(a) Graphical representation of 10 topics, combined to produce “documents” like those shown in b, where each image is the result of 100 samples from a unique mixture of these topics. (c) Performance of three algorithms on this dataset: variational Bayes (VB), expectation propagation (EP), and Gibbs sampling. Lower perplexity indicates better performance, with chance being a perplexity of 25. Estimates of the standard errors are smaller than the plot symbols, which mark 1, 5, 10, 20, 50, 100, 150, 200, 300, and 500 iterations.
Fig. 2.
Fig. 2.
Results of running the Gibbs sampling algorithm. The log-likelihood, shown on the left, stabilizes after a few hundred iterations. Traces of the log-likelihood are shown for all four runs, illustrating the consistency in values across runs. Each row of images on the right shows the estimates of the topics after a certain number of iterations within a single run, matching the points indicated on the left. These points correspond to 1, 2, 5, 10, 20, 50, 100, 150, 200, 300, and 500 iterations. The topics expressed in the data gradually emerge as the Markov chain approaches the posterior distribution.
Fig. 3.
Fig. 3.
Model selection results, showing the log-likelihood of the data for different settings of the number of topics, T. The estimated standard errors for each point were smaller than the plot symbols.
Fig. 4.
Fig. 4.
(Upper) Mean values of θ at each of the diagnostic topics for all 33 PNAS minor categories, computed by using all abstracts published in 2001. Higher probabilities are indicated with darker cells. (Lower) The five most probable words in the topics themselves listed in the same order as on the horizontal axis in Upper.
Fig. 5.
Fig. 5.
The plots show the dynamics of the three hottest and three coldest topics from 1991 to 2001, defined as those topics that showed the strongest positive and negative linear trends. The 12 most probable words in those topics are shown below the plots.
Fig. 6.
Fig. 6.
A PNAS abstract (18) tagged according to topic assignment. The superscripts indicate the topics to which individual words were assigned in a single sample, whereas the contrast level reflects the probability of a word being assigned to the most prevalent topic in the abstract, computed across samples.

Similar articles

  • Mapping topics and topic bursts in PNAS.
    Mane KK, Börner K. Mane KK, et al. Proc Natl Acad Sci U S A. 2004 Apr 6;101 Suppl 1(Suppl 1):5287-90. doi: 10.1073/pnas.0307626100. Epub 2004 Feb 20. Proc Natl Acad Sci U S A. 2004. PMID: 14978278 Free PMC article.
  • The simultaneous evolution of author and paper networks.
    Börner K, Maru JT, Goldstone RL. Börner K, et al. Proc Natl Acad Sci U S A. 2004 Apr 6;101 Suppl 1(Suppl 1):5266-73. doi: 10.1073/pnas.0307625100. Epub 2004 Feb 19. Proc Natl Acad Sci U S A. 2004. PMID: 14976254 Free PMC article.
  • Mixed-membership models of scientific publications.
    Erosheva E, Fienberg S, Lafferty J. Erosheva E, et al. Proc Natl Acad Sci U S A. 2004 Apr 6;101 Suppl 1(Suppl 1):5220-7. doi: 10.1073/pnas.0307760101. Epub 2004 Mar 12. Proc Natl Acad Sci U S A. 2004. PMID: 15020766 Free PMC article.
  • Metropolis sampling in pedigree analysis.
    Sobel E, Lange K. Sobel E, et al. Stat Methods Med Res. 1993;2(3):263-82. doi: 10.1177/096228029300200305. Stat Methods Med Res. 1993. PMID: 8261261 Review.
  • Basic Bayesian methods.
    Glickman ME, van Dyk DA. Glickman ME, et al. Methods Mol Biol. 2007;404:319-38. doi: 10.1007/978-1-59745-530-5_16. Methods Mol Biol. 2007. PMID: 18450057 Review.

Cited by

References

    1. Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022.
    1. Hofmann, T. (2001) Machine Learn. J. 42, 177-196.
    1. Cohn, D. & Hofmann, T. (2001) in Advances in Neural Information Processing Systems 13 (MIT Press, Cambridge, MA), pp. 430-436.
    1. Iyer, R. & Ostendorf, M. (1996) in Proceedings of the International Conference on Spoken Language Processing (Applied Science & Engineering Laboratories, Alfred I. duPont Inst., Wilmington, DE), Vol 1., pp. 236-239.
    1. Bigi, B., De Mori, R., El-Beze, M. & Spriet, T. (1997) in 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings (IEEE, Piscataway, NJ), pp. 535-542.

Publication types

LinkOut - more resources