Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008;9 Suppl 2(Suppl 2):S4.
doi: 10.1186/gb-2008-9-s2-s4. Epub 2008 Sep 1.

Overview of the protein-protein interaction annotation extraction task of BioCreative II

Affiliations

Overview of the protein-protein interaction annotation extraction task of BioCreative II

Martin Krallinger et al. Genome Biol. 2008.

Abstract

Background: The biomedical literature is the primary information source for manual protein-protein interaction annotations. Text-mining systems have been implemented to extract binary protein interactions from articles, but a comprehensive comparison between the different techniques as well as with manual curation was missing.

Results: We designed a community challenge, the BioCreative II protein-protein interaction (PPI) task, based on the main steps of a manual protein interaction annotation workflow. It was structured into four distinct subtasks related to: (a) detection of protein interaction-relevant articles; (b) extraction and normalization of protein interaction pairs; (c) retrieval of the interaction detection methods used; and (d) retrieval of actual text passages that provide evidence for protein interactions. A total of 26 teams submitted runs for at least one of the proposed subtasks. In the interaction article detection subtask, the top scoring team reached an F-score of 0.78. In the interaction pair extraction and mapping to SwissProt, a precision of 0.37 (with recall of 0.33) was obtained. For associating articles with an experimental interaction detection method, an F-score of 0.65 was achieved. As for the retrieval of the PPI passages best summarizing a given protein interaction in full-text articles, 19% of the submissions returned by one of the runs corresponded to curator-selected sentences. Curators extracted only the passages that best summarized a given interaction, implying that many of the automatically extracted ones could contain interaction information but did not correspond to the most informative sentences.

Conclusion: The BioCreative II PPI task is the first attempt to compare the performance of text-mining tools specific for each of the basic steps of the PPI extraction pipeline. The challenges identified range from problems in full-text format conversion of articles to difficulties in detecting interactor protein pairs and then linking them to their database records. Some limitations were also encountered when using a single (and possibly incomplete) reference database for protein normalization or when limiting search for interactor proteins to co-occurrence within a single sentence, when a mention might span neighboring sentences. Finally, distinguishing between novel, experimentally verified interactions (annotation relevant) and previously known interactions adds additional complexity to these tasks.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Manual versus automated protein-protein interaction annotation. Presented is a comparison between the manual protein-protein interaction (PPI) annotation process and the automatic extraction of protein interactions in the context of the PPI task of BioCreative II.
Figure 2
Figure 2
Precision versus recall plot for the IAS. (a) Overview plot for all of the received submissions and (b) zoomed view of the top scoring teams, with some additional details related to the methods used (SVM-based approaches are represented by circles, and other methods by triangles) as well as the AUC score. Runs with an AUC greater than 0.8 are shown in green. AUC, area under the receiver operating characteristic curve; IAS, interaction article subtask; SVM, support vector machine.
Figure 3
Figure 3
Submission agreement versus average article rank. The relation of submission agreement among different runs to the average rank of the articles is presented for both relevant and nonrelevant articles. The overall agreement between systems was lower for nonrelevant articles (R2relevant = 0.7 versus R2nonrelevant = 0.59). In general, the higher the average rank of the article, the more systems agreed on the correct classification.
Figure 4
Figure 4
Agreement of TP interactor protein normalization versus total number of protein occurrences, by corresponding organism source. This figure shows the total number of interactor proteins in the test set for each organism with respect to the percentage agreement between different participating systems in case of the correct (true positive [TP]) predictions. Each circle represents the interactor protein set for a single species in the test set. Human (red circle), mouse (pink circle), rat (orange circle), and yeast (green circle) proteins are the most frequent interactor protein organism sources in the interaction pair subtask (IPS) test set collection. A total of 50 different organisms were included in the test set (considering the SwissProt subset), corresponding to 1,110 unique interactor proteins.
Figure 5
Figure 5
TP interaction pair normalization with respect to organism source composition. This figure shows the total number of interaction pairs in the test set (840) for each corresponding organism source combination with respect to the percentage agreement between different participating systems in case of correct (true positive [TP]) predictions. Interaction pairs of proteins form different organisms (for example, between mouse and human proteins) are basically experimental in vitro interactions.

References

    1. Mishra G, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan T, Menon S, Hanumanthu G, Gupta M, Upendran S, Gupta S, Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun K, Sharma S, Chandrika K, Deshpande N, Palvankar K, Raghavnath R, Krishnakant R, Karathia H, Rekha B, Nayak R, Vishnupriya G, et al. Human protein reference database: 2006 update. Nucleic Acids Res. 2006;34:D411–D414. - PMC - PubMed
    1. Persico M, Ceol A, Gavrila C, Hoffmann R, Florio A, Cesareni G. HomoMINT: an inferred human network based on orthology mapping of protein interactions discovered in model organisms. BMC Bioinformatics. 2005;6:S21. - PMC - PubMed
    1. Mewes H, Dietmann S, Frishman D, Gregory R, Mannhaupt G, Mayer K, Muensterkoetter M, Ruepp A, Spannagl M, Stuempflen V, Rattei T. MIPS: analysis and annotation of genome information in 2007. Nucleic Acids Res. 2008;36:D196–D201. - PMC - PubMed
    1. Beuming T, Skrabanek L, Niv M, Mukherjee P, Weinstein H. PDZBase: a protein-protein interaction database for PDZ-domains. Bioinformatics. 2005;21:827–828. - PubMed
    1. Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E, deBono B, Jassal B, Gopinath G, Wu G, Matthews L, Lewis S, Birney E, Stein L. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 2005;33:D428–D432. - PMC - PubMed

Publication types