Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Aug 3:11:410.
doi: 10.1186/1471-2105-11-410.

A method for automatically extracting infectious disease-related primers and probes from the literature

Affiliations

A method for automatically extracting infectious disease-related primers and probes from the literature

Miguel García-Remesal et al. BMC Bioinformatics. .

Abstract

Background: Primer and probe sequences are the main components of nucleic acid-based detection systems. Biologists use primers and probes for different tasks, some related to the diagnosis and prescription of infectious diseases. The biological literature is the main information source for empirically validated primer and probe sequences. Therefore, it is becoming increasingly important for researchers to navigate this important information. In this paper, we present a four-phase method for extracting and annotating primer/probe sequences from the literature. These phases are: (1) convert each document into a tree of paper sections, (2) detect the candidate sequences using a set of finite state machine-based recognizers, (3) refine problem sequences using a rule-based expert system, and (4) annotate the extracted sequences with their related organism/gene information.

Results: We tested our approach using a test set composed of 297 manuscripts. The extracted sequences and their organism/gene annotations were manually evaluated by a panel of molecular biologists. The results of the evaluation show that our approach is suitable for automatically extracting DNA sequences, achieving precision/recall rates of 97.98% and 95.77%, respectively. In addition, 76.66% of the detected sequences were correctly annotated with their organism name. The system also provided correct gene-related information for 46.18% of the sequences assigned a correct organism name.

Conclusions: We believe that the proposed method can facilitate routine tasks for biomedical researchers using molecular methods to diagnose and prescribe different infectious diseases. In addition, the proposed method can be expanded to detect and extract other biological sequences from the literature. The extracted information can also be used to readily update available primer/probe databases or to create new databases from scratch.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of the primer and probe extraction process.
Figure 2
Figure 2
ST corresponding to a PDF paper from the Virology Journal identified by PubMed ID 18234069. Documents are organized hierarchically. The root node <root, null, null> represents the entire document. Complex sections (e.g. containing multiple subsections) are hierarchically decomposed according to the original paper structure. For instance, the section <abstract, "Abstract", null> can be decomposed into its three child sections: <subAbstract, "Background", "Dengue is...">, <subAbstract, "Results", "An optimal..."> and <subAbstract, "Conclusion", "These findings...">. Nodes of types table (e.g. <table, "Table 2: Comparison of...", "M-RT-PCR\tVirus isolation\nPositive\t96 (15.48%)\t...">) and figure (e.g. <figure, "Figure 1", "1.5% Agarose gel electrophoresis...">) are considered as special sections and thus allocated as children of the root node—the escape sequences "\t" and "\n" denote tab and newline characters. The natural reading order of the PDF paper can be reproduced by iterating the ST in depth-first order.
Figure 3
Figure 3
State transition diagrams describing the preliminary sequence recognizers. Circles represent regular states, whereas double circles stand for final (accepting) states. Edges denote state transitions triggered by the occurrence of any of the symbols drawn on the edges. These include 's' symbols in blue that represent strings of any length belonging to ∑+, whereas 's1', 's2' and 's3' are strings of symbols from ∑+ of lengths 1, 2 and 3 respectively. Green items represent different literals such as dashes, colons, newline tokens, etc. States labeled with the number 0 that are pointed at by an arrow with no origin represent initial states.
Figure 4
Figure 4
Plot showing how CSs are assigned to the matched organism names depending on the length of the match. Unlike regular English noun groups, where the meaning of the noun is narrowed by the preceding words, organism names are made more specific by post-positive words. The plot shows the CSs assigned to matches of length l for different values of L. This figure shows that the more specific—i.e. the longer—the matches are, the higher the assigned CS.

Similar articles

Cited by

References

    1. Bravo LT, Procop GW. Recent advances in diagnostic microbiology. Semin Hematol. 2009;46(3):248–58. doi: 10.1053/j.seminhematol.2009.03.009. - DOI - PMC - PubMed
    1. Mothershed EA, Whitney AM. Nucleic acid-based methods for the detection of bacterial pathogens: present and future considerations for the clinical laboratory. Clin Chim Acta. 2006;363(1-2):206–20. doi: 10.1016/j.cccn.2005.05.050. - DOI - PubMed
    1. Ratcliff RM, Chang G, Kok T, Sloots TP. Molecular diagnosis of medical viruses. Curr Issues Mol Biol. 2007;9(2):87–102. - PubMed
    1. Woo PC, Lau SK, Teng JL, Tse H, Yuen KY. Then and now: use of 16 S rDNA gene sequencing for bacterial identification and discovery of novel bacteria in clinical microbiology laboratories. Clin Microbiol Infect. 2008;14(10):908–34. doi: 10.1111/j.1469-0691.2008.02070.x. - DOI - PubMed
    1. Enright MC, Spratt BG. Multilocus sequence typing. Trends Microbiol. 1999;7(12):482–7. doi: 10.1016/S0966-842X(99)01609-1. - DOI - PubMed

Publication types

LinkOut - more resources