Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature
- PMID: 25485170
- PMCID: PMC4256960
- DOI: 10.1145/1557019.1557031
Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature
Abstract
A major source of information (often the most crucial and informative part) in scholarly articles from scientific journals, proceedings and books are the figures that directly provide images and other graphical illustrations of key experimental results and other scientific contents. In biological articles, a typical figure often comprises multiple panels, accompanied by either scoped or global captioned text. Moreover, the text in the caption contains important semantic entities such as protein names, gene ontology, tissues labels, etc., relevant to the images in the figure. Due to the avalanche of biological literature in recent years, and increasing popularity of various bio-imaging techniques, automatic retrieval and summarization of biological information from literature figures has emerged as a major unsolved challenge in computational knowledge extraction and management in the life science. We present a new structured probabilistic topic model built on a realistic figure generation scheme to model the structurally annotated biological figures, and we derive an efficient inference algorithm based on collapsed Gibbs sampling for information retrieval and visualization. The resulting program constitutes one of the key IR engines in our SLIF system that has recently entered the final round (4 out 70 competing systems) of the Elsevier Grand Challenge on Knowledge Enhancement in the Life Science. Here we present various evaluations on a number of data mining tasks to illustrate our method.
Keywords: Algorithms; Experimentation.
Figures








Similar articles
-
A framework for biomedical figure segmentation towards image-based document retrieval.BMC Syst Biol. 2013;7 Suppl 4(Suppl 4):S8. doi: 10.1186/1752-0509-7-S4-S8. Epub 2013 Oct 23. BMC Syst Biol. 2013. PMID: 24565394 Free PMC article.
-
Structured Literature Image Finder: Parsing Text and Figures in Biomedical Literature.Web Semant. 2010 Jul 1;8(2-3):151-154. doi: 10.1016/j.websem.2010.04.002. Web Semant. 2010. PMID: 24991197 Free PMC article.
-
Improved recognition of figures containing fluorescence microscope images in online journal articles using graphical models.Bioinformatics. 2008 Feb 15;24(4):569-76. doi: 10.1093/bioinformatics/btm561. Epub 2007 Nov 22. Bioinformatics. 2008. PMID: 18033795 Free PMC article.
-
Linking genes to literature: text mining, information extraction, and retrieval applications for biology.Genome Biol. 2008;9 Suppl 2(Suppl 2):S8. doi: 10.1186/gb-2008-9-s2-s8. Epub 2008 Sep 1. Genome Biol. 2008. PMID: 18834499 Free PMC article. Review.
-
Knowledge discovery in biology and biotechnology texts: a review of techniques, evaluation strategies, and applications.Crit Rev Biotechnol. 2005 Jan-Jun;25(1-2):31-52. doi: 10.1080/07388550590935571. Crit Rev Biotechnol. 2005. PMID: 15999851 Review.
Cited by
-
DeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures.PLoS One. 2015 May 7;10(5):e0126200. doi: 10.1371/journal.pone.0126200. eCollection 2015. PLoS One. 2015. PMID: 25951377 Free PMC article.
-
Social and content aware One-Class recommendation of papers in scientific social networks.PLoS One. 2017 Aug 3;12(8):e0181380. doi: 10.1371/journal.pone.0181380. eCollection 2017. PLoS One. 2017. PMID: 28771495 Free PMC article.
-
Figure text extraction in biomedical literature.PLoS One. 2011 Jan 13;6(1):e15338. doi: 10.1371/journal.pone.0015338. PLoS One. 2011. PMID: 21249186 Free PMC article.
-
Automatic figure classification in bioscience literature.J Biomed Inform. 2011 Oct;44(5):848-58. doi: 10.1016/j.jbi.2011.05.003. Epub 2011 May 27. J Biomed Inform. 2011. PMID: 21645638 Free PMC article.
-
Structured digital tables on the Semantic Web: toward a structured digital literature.Mol Syst Biol. 2010 Aug 24;6:403. doi: 10.1038/msb.2010.45. Mol Syst Biol. 2010. PMID: 20739925 Free PMC article.
References
-
- Barnard K, Duygulu P, de Freitas N, Forsyth D, Blei D, Jordan M. Matching words and pictures. JMLR. 2003;3:1107–1135.
-
- Blei D, Jordan M. Modeling annotated data. ACM SIGIR. 2003
-
- Chemudugunta C, Smyth P, Steyvers M. Modeling general and specific aspects of documents with a probabilistic topic model. NIPS. 2006
-
- Cohen WW, Wang R, Murphy RF. Understanding captions in biological publications. ACM KDD. 2005
Grants and funding
LinkOut - more resources
Full Text Sources