Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jul 25;17 Suppl 7(Suppl 7):246.
doi: 10.1186/s12859-016-1100-z.

Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features

Affiliations

Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features

Thi Thanh Thuy Phan et al. BMC Bioinformatics. .

Abstract

Background: Protein-protein interaction (PPI) extraction from published scientific articles is one key issue in biological research due to its importance in grasping biological processes. Despite considerable advances of recent research in automatic PPI extraction from articles, demand remains to enhance the performance of the existing methods.

Results: Our feature-based method incorporates the strength of many kinds of diverse features, such as lexical and word context features derived from sentences, syntactic features derived from parse trees, and features using existing patterns to extract PPIs automatically from articles. Among these abundant features, we assemble the related features into four groups and define the contribution level (CL) for each group, which consists of related features. Our method consists of two steps. First, we divide the training set into subsets based on the structure of the sentence and the existence of significant keywords (SKs) and apply the sentence patterns given in advance to each subset. Second, we automatically perform feature selection based on the CL values of the four groups that consist of related features and the k-nearest neighbor algorithm (k-NN) through three approaches: (1) focusing on the group with the best contribution level (BEST1G); (2) unoptimized combination of three groups with the best contribution levels (U3G); (3) optimized combination of two groups with the best contribution levels (O2G).

Conclusions: Our method outperforms other state-of-the-art PPI extraction systems in terms of F-score on the HPRD50 corpus and achieves promising results that are comparable with these PPI extraction systems on other corpora. Further, our method always obtains the best F-score on all the corpora than when using k-NN only without exploiting the CLs of the groups of related features.

Keywords: Biomedical text mining; Information extraction; Protein protein interaction; k-nearest neighbors.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Example of a constituent parse tree. Constituent parse tree for sentence, “Oxytocin stimulates IP3 production in dose-dependent fashion as well,” from sentence IEPA.d0.s0 of IEPA corpus (first protein P1 is Oxytocin and second protein P2 is IP3)
Fig. 2
Fig. 2
Framework of our PPI extraction system. Our system consists of two phases. First, training set is divided into subsets based on presence of significant keywords and the feature position of keyword. Second, after cross-validation is performed on the training data to assess the contribution levels of four groups, which consist of related features, feature selection is performed automatically through our three approaches (BEST1G, U3G, O2G). Finally, the k-NN classifier is used to classify candidate PPI pairs of test data
Fig. 3
Fig. 3
Outline of PPI prediction based on division of training set. Training set was divided into subsets, A, B, and C, based on existence of significant keyword and feature position of keyword. Three classifiers were generated from every subset. Similarly, unlabeled instances were divided into one of three subsets, A’, B’, and C’, and corresponding classifier was used to identify whether PPIs exist in these instances
Fig. 4
Fig. 4
S-fold cross-validation (SFCV) performed on original training data. Original training data T r a i n all was divided into S equal-sized partitions P i(i=0,⋯,S−1) to perform SFCV on it to estimate contribution levels of four groups, G 1, G 2, G 3, and G 4, and perform feature selection

Similar articles

Cited by

References

    1. Liu B, Qian L, Wang H, Zhou G. Dependency-driven feature-based learning for extracting protein-protein interactions from biomedical text. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). Beijing, China: 2010. p. 757–765.
    1. Landeghem S, Saeys Y, Peer Y. Extracting protein-protein interactions from text using rich feature vectors and feature selection. In: Proceedings of the Third International Symposium on Semantic Mining in Biomedicine. Turku, Finland: 2008. p. 77–84.
    1. Airola A, Pyysalo S, Bjorne J, Pahikkalla T, Ginter F, Salakoski T. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics. 2008;9(Suppl 11):S2. doi: 10.1186/1471-2105-9-S11-S2. - DOI - PMC - PubMed
    1. Miwa M, Sætre R, Miyao Y, Tsujii J. Protein-protein interaction extraction by leveraging multiple kernels and parsers. Int J Med Inf. 2009;78(12):e39–e46. doi: 10.1016/j.ijmedinf.2009.04.010. - DOI - PubMed
    1. Qian L, Zhou G. Tree kernel-based protein-protein interaction extraction from biomedical literature. J. Biomed. Inf. 2012;45(3):535–543. doi: 10.1016/j.jbi.2012.02.004. - DOI - PubMed

LinkOut - more resources