. 2008 Apr 11;9 Suppl 3(Suppl 3):S11.

doi: 10.1186/1471-2105-9-S3-S11.

Identification of transcription factor contexts in literature using machine learning approaches

Hui Yang¹, Goran Nenadic, John A Keane

Affiliations

PMID: 18426546
PMCID: PMC2352869
DOI: 10.1186/1471-2105-9-S3-S11

Identification of transcription factor contexts in literature using machine learning approaches

Hui Yang et al. BMC Bioinformatics. 2008.

. 2008 Apr 11;9 Suppl 3(Suppl 3):S11.

doi: 10.1186/1471-2105-9-S3-S11.

Authors

Hui Yang¹, Goran Nenadic, John A Keane

Affiliation

¹ School of Computer Science, University of Manchester, Manchester, UK. Hui.Yang@manchester.ac.uk

PMID: 18426546
PMCID: PMC2352869
DOI: 10.1186/1471-2105-9-S3-S11

Abstract

Background: Availability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the development of semi-automated text mining support is hindered by unavailability of training data. There have been no studies on how existing data sources (e.g. TF-related data from the MeSH thesaurus and GO ontology) or potentially noisy example data (e.g. protein-protein interaction, PPI) could be used to provide training data for identification of TF-contexts in literature.

Results: In this paper we describe a text-classification system designed to automatically recognise contexts related to transcription factors in literature. A learning model is based on a set of biological features (e.g. protein and gene names, interaction words, other biological terms) that are deemed relevant for the task. We have exploited background knowledge from existing biological resources (MeSH and GO) to engineer such features. Weak and noisy training datasets have been collected from descriptions of TF-related concepts in MeSH and GO, PPI data and data representing non-protein-function descriptions. Three machine-learning methods are investigated, along with a vote-based merging of individual approaches and/or different training datasets. The system achieved highly encouraging results, with most classifiers achieving an F-measure above 90%.

Conclusions: The experimental results have shown that the proposed model can be used for identification of TF-related contexts (i.e. sentences) with high accuracy, with a significantly reduced set of features when compared to traditional bag-of-words approach. The results of considering existing PPI data suggest that there is not as high similarity between TF and PPI contexts as we have expected. We have also shown that existing knowledge sources are useful both for feature engineering and for obtaining noisy positive training data.

PubMed Disclaimer

Figures

**Figure 1**
Overall architecture of the approach.

**Figure 2**
The average KL divergence of feature distributions between (1) TF and PPI, and (2) TF and NonPF datasets for the GM and BM models, when the top ranked features are considered (TF& PPI_GM = feature distribution in TF vs. feature distribution in PPI in GM model, etc.)

**Figure 3**
The F-measures of the three machining learning approaches on the TF&NonPF dataset (GM = generic model; BM = biological model)

**Figure 4**
The F-measure of the three machining learning approaches on the TF&PPI dataset (GM = generic model; BM = biological model)

See this image and copyright information in PMC

References

1. Blaschke C, Andrade MA, Ouzounis C, Valencia A. Automatic extraction of biological information from scientific text: protein–protein interactions. Proc Int Conf Intell Syst Mol Biol. 1999:60–67. - PubMed
1. Chiang J, Yu H. MeKE: discovering the functions of gene products from biomedical literature via sentence alignment. Bioinformatics. 2003;19:1417–1422. - PubMed
1. Hao Y, Zhu X, Huang M, Li M. Discovering patterns to extract protein-protein interactions from the literature. Bioinformatics. 2005;21:3294–3300. - PubMed
1. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics. 2001;17:S74–S82. - PubMed
1. Huang M, Zhu X, Hao Y, Payan DG, Qu K, Li M. Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics. 2004;20:3604–3612. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

BB/C007360/1/BB_/Biotechnology and Biological Sciences Research Council/United Kingdom

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identification of transcription factor contexts in literature using machine learning approaches

Affiliation

Identification of transcription factor contexts in literature using machine learning approaches

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous