Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature

Amr Ahmed¹, Eric P Xing², William W Cohen², Robert F Murphy²

Affiliations

¹ School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213.
² School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 ; Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213.

PMID: 25485170
PMCID: PMC4256960
DOI: 10.1145/1557019.1557031

Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature

Amr Ahmed et al. KDD. 2009.

. 2009:2009:39-48.

doi: 10.1145/1557019.1557031.

Authors

Amr Ahmed¹, Eric P Xing², William W Cohen², Robert F Murphy²

Affiliations

¹ School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213.
² School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 ; Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213.

PMID: 25485170
PMCID: PMC4256960
DOI: 10.1145/1557019.1557031

Abstract

A major source of information (often the most crucial and informative part) in scholarly articles from scientific journals, proceedings and books are the figures that directly provide images and other graphical illustrations of key experimental results and other scientific contents. In biological articles, a typical figure often comprises multiple panels, accompanied by either scoped or global captioned text. Moreover, the text in the caption contains important semantic entities such as protein names, gene ontology, tissues labels, etc., relevant to the images in the figure. Due to the avalanche of biological literature in recent years, and increasing popularity of various bio-imaging techniques, automatic retrieval and summarization of biological information from literature figures has emerged as a major unsolved challenge in computational knowledge extraction and management in the life science. We present a new structured probabilistic topic model built on a realistic figure generation scheme to model the structurally annotated biological figures, and we derive an efficient inference algorithm based on collapsed Gibbs sampling for information retrieval and visualization. The resulting program constitutes one of the key IR engines in our SLIF system that has recently entered the final round (4 out 70 competing systems) of the Elsevier Grand Challenge on Knowledge Enhancement in the Life Science. Here we present various evaluations on a number of data mining tasks to illustrate our method.

Keywords: Algorithms; Experimentation.

PubMed Disclaimer

Figures

**Figure 1**
Overview of our approach, please refer to Section 2 for more details. (Best viewed in *color*)

**Figure 2**
The cLDA and struct-cLDA Models. Shaded circles represent observed variables and their colors denote modality (blue for words, red for protein entities, and cyan for image features), unshaded circles represent hidden variables, diamonds represent model parameters, and plates represent replications. Some super/subscripts are removed for clarity—see text for explanation.

**Figure 3**
Illustrative three topics from a 20-topics run of the struct-cLDA model. See Section 5.1 for more details.

**Figure 4**
Illustrating topic decomposition and structured browsing. A biological figure tagged with its topic decomposition at different granularities: each panel (top-right), caption words (second row), and the whole figure (bottom-left). In tagging the caption, light grey colors are used for words that were removed during pre-processing stages, and dark grey colors are used for background words. Some topics are illustrated at the bottom row. (best viewed in color)

**Figure 5**
Understating model’s features contributions: (a) Convergence (b) Time per iteration and (c) Perplexity

**Figure 6**
Evaluating protein annotation quality based on observing text and image features (*Lower better*)

**Figure 7**
Illustrating figure retrieval performance. Each column depicts the result for a give query written on its top with the number of true positives written in parenthesis (the size of the test set is 131 figures). The figure shows comparisons between struct-cLDA and LSI. The *horizontal lines* are the average precision for each model. (Better viewed in *color*)

**Figure 8**
Illustrating the utility of using partial figures as a function of its ratio in the training set. The task is protein annotation based on (a) Figure’s image and text and (b) Image content of the figure only

See this image and copyright information in PMC

References

1. Ahmed A, Xing EP, Cohen WW, Murphy RF. Structured correspondence topic models for mining captioned figures in biological literature. Technical report, CMU. 2009 - PMC - PubMed
1. Barnard K, Duygulu P, de Freitas N, Forsyth D, Blei D, Jordan M. Matching words and pictures. JMLR. 2003;3:1107–1135.
1. Blei D, Jordan M. Modeling annotated data. ACM SIGIR. 2003
1. Chemudugunta C, Smyth P, Steyvers M. Modeling general and specific aspects of documents with a probabilistic topic model. NIPS. 2006
1. Cohen WW, Wang R, Murphy RF. Understanding captions in biological publications. ACM KDD. 2005

Grants and funding

R01 GM078622/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature

Affiliations

Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources