Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jan 16:14:12.
doi: 10.1186/1471-2105-14-12.

A detailed error analysis of 13 kernel methods for protein-protein interaction extraction

Affiliations

A detailed error analysis of 13 kernel methods for protein-protein interaction extraction

Domonkos Tikk et al. BMC Bioinformatics. .

Abstract

Background: Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level.

Results: We report on a detailed analysis of the shared characteristics and the differences between 13 current methods using five PPI corpora. We identified a large number of rather difficult (misclassified by most methods) and easy (correctly classified by most methods) PPIs. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. However, our analysis also reveals that characteristics shared between difficult pairs are few, which lowers the hope that new methods, if built along the same line as current ones, will deliver breakthroughs in extraction performance.

Conclusions: Our experiments show that current methods do not seem to do very well in capturing the shared characteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (still very few) available corpora. Our analysis suggests that performance improvements shall be sought after rather in novel feature sets than in novel kernel functions.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The distribution of pairs according to classification success level using cross-validation setting. The distribution of pairs (total, positive and negative) in terms of the number of kernels that classify them correctly (success level) aggregated across the 5 corpora in cross-validation setting. Detailed data for each corpus can be find in Table 1. All 13 kernels are taken into consideration.
Figure 2
Figure 2
The distribution of pairs according to classification success level using cross-learning setting. The distribution of pairs (total, positive and negative) in terms of the number of kernels that classify them correctly (success level) aggregated across the 5 corpora in cross-learning setting. Detailed data for each corpus can be find in Table 2. All kernels except for the very slow PT kernel are taken into consideration.
Figure 3
Figure 3
Heatmap of success level correlation in CV and CL evaluations. Correlation ranges from 2 (cyan) through 63 (white) to 1266 (magenta) pairs. Hues are on logarithmic scale.
Figure 4
Figure 4
Characteristics of pairs by difficulty class. Characteristics of pairs by difficulty class (average sentence length in words, average word distance between entities, average distance in the dependency graph (DG) and syntax tree (ST) shortest path). ND – negative difficult, NN – negative neutral, NE – negative easy, PD – positive difficult, PN – positive neutral, PE – positive easy.
Figure 5
Figure 5
The number of positive and negative pairs vs. the length of the sentence containing the pair.
Figure 6
Figure 6
The positive ground truth rate vs. the length of the sentence containing the pair.
Figure 7
Figure 7
Class distribution of pairs depending on the number of proteins in the sentence.
Figure 8
Figure 8
Similarity of kernels as dendrogram and heat map. Colors below the dendrogram indicate the parsing information used by a kernel. Similarity of kernel outputs ranges from full agreement (red) to 33% disagreement (yellow) on the five benchmark corpora. Clustering is performed with R’s hclust (http://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html).
Figure 9
Figure 9
Comparison of some non-kernel based and kernel based classifiers in terms of F-score (CV evaluation). The first 9 are non-kernel based classifiers, the last four are kernel based classifiers.

Similar articles

Cited by

References

    1. Blaschke C, Andrade MA, Ouzounis C, Valencia A. Automatic extraction of biological information from scientific text: protein-protein interactions. Proc Int Conf Intell Syst Mol Biol. 1999;7:60–67. - PubMed
    1. Ono T, Hishigaki H, Tanigami A, Takagi T. Automated extraction of information on protein–protein interactions from the biological literature. Bioinformatics. 2001;17(2):155. doi: 10.1093/bioinformatics/17.2.155. - DOI - PubMed
    1. Marcotte EM, Xenarios I, Eisenberg D. Mining literature for protein–protein interactions. Bioinformatics. 2001;17(4):359. doi: 10.1093/bioinformatics/17.4.359. - DOI - PubMed
    1. Huang M, Zhu X, Hao Y, Payan DG, Qu K, Li M. Discovering patterns to extract protein–protein interactions from full texts. Bioinformatics. 2004;20(18):3604. doi: 10.1093/bioinformatics/bth451. - DOI - PubMed
    1. Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Brief Bioinformatics. 2005;6:57. doi: 10.1093/bib/6.1.57. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources