Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Aug 7:8:293.
doi: 10.1186/1471-2105-8-293.

Automatic reconstruction of a bacterial regulatory network using Natural Language Processing

Affiliations

Automatic reconstruction of a bacterial regulatory network using Natural Language Processing

Carlos Rodríguez-Penagos et al. BMC Bioinformatics. .

Abstract

Background: Manual curation of biological databases, an expensive and labor-intensive process, is essential for high quality integrated data. In this paper we report the implementation of a state-of-the-art Natural Language Processing system that creates computer-readable networks of regulatory interactions directly from different collections of abstracts and full-text papers. Our major aim is to understand how automatic annotation using Text-Mining techniques can complement manual curation of biological databases. We implemented a rule-based system to generate networks from different sets of documents dealing with regulation in Escherichia coli K-12.

Results: Performance evaluation is based on the most comprehensive transcriptional regulation database for any organism, the manually-curated RegulonDB, 45% of which we were able to recreate automatically. From our automated analysis we were also able to find some new interactions from papers not already curated, or that were missed in the manual filtering and review of the literature. We also put forward a novel Regulatory Interaction Markup Language better suited than SBML for simultaneously representing data of interest for biologists and text miners.

Conclusion: Manual curation of the output of automatic processing of text is a good way to complement a more detailed review of the literature, either for validating the results of what has been already annotated, or for discovering facts and information that might have been overlooked at the triage or curation stages.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Annotation workflow. A suggested workflow for parallel manual and automatic annotations of transcriptional regulation, with manual review of automatically-generated networks in shaded lines. Curators would check the interactions mined from text, since they would be provided with the reference papers and the textual segments from which the system retrieved them.
Figure 2
Figure 2
Corpus coverage of transcriptional regulation in E. Coli. Venn diagram illustrates overlapping coverage in corpora used in this work, with dots representing papers relevant for transcriptional regulation in E. coli K-12. Different selection strategies (keyword searches on PubMed and curated databases references) result in diverse document sets, which can contain in some cases groups of the same documents as well as other non-relevant papers.
Figure 3
Figure 3
Curated and retrieved articles for RegulonDB, by year. Comparison between all references initially retrieved from PubMed using RegulonDB curator's search algorithms, and references that were finally reviewed in full to populate the database. Since search algorithms are refined and changed continuously, this is shown only for illustration.

Similar articles

Cited by

References

    1. Karp PD. Pathway databases: a case study in computational symbolic theories. Science. 293:2040–4. doi: 10.1126/science.1064621. 2001 Sep 14. - DOI - PubMed
    1. Keseler IM, Collado-Vides J, et al. EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res. 33:D334–7. doi: 10.1093/nar/gki108. 2005 Jan 1. - DOI - PMC - PubMed
    1. Cherry JM, Ball C, Weng S, Juvik G, Schmidt R, Adler C, Dunn B, Dwight S, Riles L, Mortimer RK, Botstein D. Genetic and physical maps of Saccharomyces cerevisiae. Nature. 1997;387:67–73. - PMC - PubMed
    1. Grivell L. Mining the bibliome: searching for a needle in a haystack? EMBO Rep. 2002;3:200–3. doi: 10.1093/embo-reports/kvf059. - DOI - PMC - PubMed
    1. Yandell MD, Majoros WH. Genomics and natural language processing. Nature Reviews – Genetics. 2002;3:601–10. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources