. 2010 Aug 3:11:410.

doi: 10.1186/1471-2105-11-410.

A method for automatically extracting infectious disease-related primers and probes from the literature

Miguel García-Remesal¹, Alejandro Cuevas, Victoria López-Alonso, Guillermo López-Campos, Guillermo de la Calle, Diana de la Iglesia, David Pérez-Rey, José Crespo, Fernando Martín-Sánchez, Víctor Maojo

Affiliations

PMID: 20682041
PMCID: PMC2923139
DOI: 10.1186/1471-2105-11-410

A method for automatically extracting infectious disease-related primers and probes from the literature

Miguel García-Remesal et al. BMC Bioinformatics. 2010.

. 2010 Aug 3:11:410.

doi: 10.1186/1471-2105-11-410.

Authors

Affiliation

¹ Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Madrid, Spain. mgarcia@infomed.dia.fi.upm.es

PMID: 20682041
PMCID: PMC2923139
DOI: 10.1186/1471-2105-11-410

Abstract

Background: Primer and probe sequences are the main components of nucleic acid-based detection systems. Biologists use primers and probes for different tasks, some related to the diagnosis and prescription of infectious diseases. The biological literature is the main information source for empirically validated primer and probe sequences. Therefore, it is becoming increasingly important for researchers to navigate this important information. In this paper, we present a four-phase method for extracting and annotating primer/probe sequences from the literature. These phases are: (1) convert each document into a tree of paper sections, (2) detect the candidate sequences using a set of finite state machine-based recognizers, (3) refine problem sequences using a rule-based expert system, and (4) annotate the extracted sequences with their related organism/gene information.

Results: We tested our approach using a test set composed of 297 manuscripts. The extracted sequences and their organism/gene annotations were manually evaluated by a panel of molecular biologists. The results of the evaluation show that our approach is suitable for automatically extracting DNA sequences, achieving precision/recall rates of 97.98% and 95.77%, respectively. In addition, 76.66% of the detected sequences were correctly annotated with their organism name. The system also provided correct gene-related information for 46.18% of the sequences assigned a correct organism name.

Conclusions: We believe that the proposed method can facilitate routine tasks for biomedical researchers using molecular methods to diagnose and prescribe different infectious diseases. In addition, the proposed method can be expanded to detect and extract other biological sequences from the literature. The extracted information can also be used to readily update available primer/probe databases or to create new databases from scratch.

PubMed Disclaimer

Figures

**Figure 1**
**Overview of the primer and probe extraction process**.

**Figure 2**
**ST corresponding to a PDF paper from the Virology Journal identified by PubMed ID 18234069**. Documents are organized hierarchically. The root node <***root***, *null*, *null*> represents the entire document. Complex sections (e.g. containing multiple subsections) are hierarchically decomposed according to the original paper structure. For instance, the section <***abstract***, *"Abstract"*, *null*> can be decomposed into its three child sections: <***subAbstract***, *"Background"*, *"Dengue is..."*>, <***subAbstract***, *"Results"*, *"An optimal..."*> and <***subAbstract***, *"Conclusion"*, *"These findings..."*>. Nodes of types ***table*** (e.g. <***table***, *"Table 2: Comparison of..."*, *"M-RT-PCR\tVirus isolation\nPositive\t96 (15.48%)\t..."*>) and ***figure*** (e.g. <***figure***, *"Figure 1"*, *"1.5% Agarose gel electrophoresis..."*>) are considered as special sections and thus allocated as children of the root node—the escape sequences "\t" and "\n" denote tab and newline characters. The natural reading order of the PDF paper can be reproduced by iterating the ST in depth-first order.

**Figure 3**
**State transition diagrams describing the preliminary sequence recognizers**. Circles represent regular states, whereas double circles stand for final (accepting) states. Edges denote state transitions triggered by the occurrence of any of the symbols drawn on the edges. These include 's' symbols in blue that represent strings of any length belonging to ∑⁺, whereas 's1', 's2' and 's3' are strings of symbols from ∑⁺of lengths 1, 2 and 3 respectively. Green items represent different literals such as dashes, colons, newline tokens, etc. States labeled with the number 0 that are pointed at by an arrow with no origin represent initial states.

**Figure 4**
**Plot showing how CSs are assigned to the matched organism names depending on the length of the match**. Unlike regular English noun groups, where the meaning of the noun is narrowed by the preceding words, organism names are made more specific by post-positive words. The plot shows the CSs assigned to matches of length l for different values of L. This figure shows that the more specific—i.e. the longer—the matches are, the higher the assigned CS.

See this image and copyright information in PMC

Cited by

MiPRIME: an integrated and intelligent platform for mining primer and probe sequences of microbial species.
Zhang Z, Ren J, Ren L, Zhang L, Ai Q, Long H, Ren Y, Yang K, Feng H, Li S, Li X. Zhang Z, et al. Bioinformatics. 2024 Jul 1;40(7):btae429. doi: 10.1093/bioinformatics/btae429. Bioinformatics. 2024. PMID: 38954836 Free PMC article.
Using nanoinformatics methods for automatically identifying relevant nanotoxicology entities from the literature.
García-Remesal M, García-Ruiz A, Pérez-Rey D, de la Iglesia D, Maojo V. García-Remesal M, et al. Biomed Res Int. 2013;2013:410294. doi: 10.1155/2013/410294. Epub 2012 Dec 27. Biomed Res Int. 2013. PMID: 23509721 Free PMC article.
e-MIR2: a public online inventory of medical informatics resources.
de la Calle G, García-Remesal M, Nkumu-Mbomio N, Kulikowski C, Maojo V. de la Calle G, et al. BMC Med Inform Decis Mak. 2012 Aug 2;12:82. doi: 10.1186/1472-6947-12-82. BMC Med Inform Decis Mak. 2012. PMID: 22857741 Free PMC article.
Annotating genes and genomes with DNA sequences extracted from biomedical articles.
Haeussler M, Gerner M, Bergman CM. Haeussler M, et al. Bioinformatics. 2011 Apr 1;27(7):980-6. doi: 10.1093/bioinformatics/btr043. Epub 2011 Feb 16. Bioinformatics. 2011. PMID: 21325301 Free PMC article.

References

1. Bravo LT, Procop GW. Recent advances in diagnostic microbiology. Semin Hematol. 2009;46(3):248–58. doi: 10.1053/j.seminhematol.2009.03.009. - DOI - PMC - PubMed
1. Mothershed EA, Whitney AM. Nucleic acid-based methods for the detection of bacterial pathogens: present and future considerations for the clinical laboratory. Clin Chim Acta. 2006;363(1-2):206–20. doi: 10.1016/j.cccn.2005.05.050. - DOI - PubMed
1. Ratcliff RM, Chang G, Kok T, Sloots TP. Molecular diagnosis of medical viruses. Curr Issues Mol Biol. 2007;9(2):87–102. - PubMed
1. Woo PC, Lau SK, Teng JL, Tse H, Yuen KY. Then and now: use of 16 S rDNA gene sequencing for bacterial identification and discovery of novel bacteria in clinical microbiology laboratories. Clin Microbiol Infect. 2008;14(10):908–34. doi: 10.1111/j.1469-0691.2008.02070.x. - DOI - PubMed
1. Enright MC, Spratt BG. Multilocus sequence typing. Trends Microbiol. 1999;7(12):482–7. doi: 10.1016/S0966-842X(99)01609-1. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A method for automatically extracting infectious disease-related primers and probes from the literature

Affiliation

A method for automatically extracting infectious disease-related primers and probes from the literature

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources