Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2007 Jul 10:8:243.
doi: 10.1186/1471-2105-8-243.

Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks

Affiliations
Comparative Study

Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks

Nikolai Daraselia et al. BMC Bioinformatics. .

Abstract

Background: Uncovering cellular roles of a protein is a task of tremendous importance and complexity that requires dedicated experimental work as well as often sophisticated data mining and processing tools. Protein functions, often referred to as its annotations, are believed to manifest themselves through topology of the networks of inter-proteins interactions. In particular, there is a growing body of evidence that proteins performing the same function are more likely to interact with each other than with proteins with other functions. However, since functional annotation and protein network topology are often studied separately, the direct relationship between them has not been comprehensively demonstrated. In addition to having the general biological significance, such demonstration would further validate the data extraction and processing methods used to compose protein annotation and protein-protein interactions datasets.

Results: We developed a method for automatic extraction of protein functional annotation from scientific text based on the Natural Language Processing (NLP) technology. For the protein annotation extracted from the entire PubMed, we evaluated the precision and recall rates, and compared the performance of the automatic extraction technology to that of manual curation used in public Gene Ontology (GO) annotation. In the second part of our presentation, we reported a large-scale investigation into the correspondence between communities in the literature-based protein networks and GO annotation groups of functionally related proteins. We found a comprehensive two-way match: proteins within biological annotation groups form significantly denser linked network clusters than expected by chance and, conversely, densely linked network communities exhibit a pronounced non-random overlap with GO groups. We also expanded the publicly available GO biological process annotation using the relations extracted by our NLP technology. An increase in the number and size of GO groups without any noticeable decrease of the link density within the groups indicated that this expansion significantly broadens the public GO annotation without diluting its quality. We revealed that functional GO annotation correlates mostly with clustering in a physical interaction protein network, while its overlap with indirect regulatory network communities is two to three times smaller.

Conclusion: Protein functional annotations extracted by the NLP technology expand and enrich the existing GO annotation system. The GO functional modularity correlates mostly with the clustering in the physical interaction network, suggesting that the essential role of structural organization maintained by these interactions. Reciprocally, clustering of proteins in physical interaction networks can serve as an evidence for their functional similarity.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Distribution of the number of protein-GO association for three GO annotations. Horizontal axis, GO degree – number of GO associations; Vertical axis, Probability of GO degree – fraction of proteins with a given GO degree; Red line – public GOA, green – MedScan GOA, black – combined GOA.
Figure 2
Figure 2
Comparison of MedScan performance with other methods for automatic protein annotation reported in Blaschke et al [28] for BioCreAtIvE task 2.2. The figure is copied from Figure 4 of Blaschke et al [28] and MedScan performance added as one more method. Each point represents a single run submitted by the participants of task 2.2. User 1: Chiang et al. [45], 2: Couto et al. [46], 3: Ehrler et al. [47], 4: Ray et al. [48], 5: Rice et al. [49], 6: Verspoor et al. [50]. MedScan performance was estimated by comparison with the protein-GO annotation extracted by human curators from European Bioinformatics Institute (EBI).
Figure 3
Figure 3
A scatter plot of the number of links of a randomized versus real binding network in the public cellular component GOA. All GO groups below the diagonal line have the number of randomized links lower the real ones. The error bars correspond to the p-value 10-6 for normal distribution; that is, if the top of an error bar lies below the diagonal line, the probability that the corresponding GO group has this number of links by pure chance is equal or less than 10-6. It appears that only a few small GO groups are not linked densely enough to satisfy the 10-6 threshold.
Figure 4
Figure 4
Dependence of protein degree in ResNet 4.0 (i.e. the number of regulatory interactions with other proteins in the database) on the GO degree (i.e. number of GO annotation for a protein). Red pluses – public Biological processes GO annotation; black circles – combined public and MedScan GOA. The plot shows that MedScan GOA adds highly connected proteins to GO groups from public annotation.
Figure 5
Figure 5
An overlap between a network cluster obtained by the Potts model algorithm [31] and the best-matching GO groups from the public cellular component GOA. The cluster contains 11 proteins: 10 subunits of RNA polymerase II and a Vpr protein from Human immunodeficiency virus 1. RNA polymerase II is a well-characterized and stable multi-subunit complex that is formed due to the physical interactions of its subunits. RNA polymerase II is involved in the mRNA synthesis for all eukaryotic protein-coding genes. Vpr protein from HIV has diverse function and regulates the expression of many cellular genes during HIV infection as well as accelerates the production of viral proteins. A – The portion of GO classification overlapping with network cluster. The figure shows the part of the GO classification hierarchy with the bottom node being the GO group that has the statistically the best overlap with the Potts cluster. GO groups are depicted as rectangles and the parent-child relation in the GO tree is shown as a line with an arrow. Only those parent GO groups that have a statistically significant overlap with the Potts cluster are shown. The numbers above the line show the number of proteins common with the Potts cluster (before the slash) and the total number of proteins in the GO group. The Δc value below the arrow is the number of standard deviations by which the overlap is bigger than the overlap expected by random chance. B – The network cluster overlapping with GO classification from Figure A. Highlighted proteins belong to the best overlapping GO group from cellular component classification DNA-directed RNA polymerase II, core complex (GO:0005665).
Figure 6
Figure 6
An overlap between a network cluster that was obtained by the Potts algorithm [31] and the best matching GO groups from the molecular function GOA. The cluster contains eight proteins: five heterodimerizing proteins from the ionotropic glutamate receptor family, syndecan binding protein SDCBP, gamma subunit 2 of voltage-dependent calcium channel (CACNG2), and protein kinase C alpha binding protein (PRKCABP). The molecular function GOA shows the smallest correlation with network clustering among all GOAs (see Results section for details). Nevertheless, the correlation is still significant and provides additional confirmation to the observation that paralogous proteins tend to interact with each other more often than non-paralogous proteins [39]. The picture shows the example of the paralog heterodimerization that form a cluster in the physical interaction network. A – The portion of GO classification overlapping with network cluster. The GO classification tree depiction is the same as in Figure 5A. B – The network cluster overlapping with GO classification from Figure A. Highlighted proteins belong to the best overlapping GO group from molecular function classification – alpha-amino-3-hydroxy-5-methyl-4-isoxazole propionate selective glutamate receptor activity (GO: 0004971). The proteins selected by the blue line belong to the second best overlapping GO group from molecular function classification – potassium channel activity (GO:0005267). Gray links indicate DirectRegulation relation, violet links indicate Binding relation.
Figure 7
Figure 7
An overlap between a network cluster obtained by Potts algorithm [31] and the best matching GO groups from the biological function GOA combined from MedScan annotation and public annotation. The cluster contains nine proteins involved in DNA repair and telomere capping: ATM – ataxia telangiectasia mutated homolog (human) (mapped); PRKDC – catalytic polypeptide of DNA activated protein kinase; NBS1 – nibrin; CHEK2 – protein kinase Chk2; XRCC5 – X-ray repair complementing defective repair in Chinese hamster cells 5; H2AFX – dolichyl-phosphate (UDP-N-acetylglucosamine) N-acetylglucosaminephosphotransferase 1 (GlcNAc-1-P transferase); G22P1 – thyroid autoantigen; NFBD1 – mediator of DNA damage checkpoint 1; TREX1 – three prime repair exonuclease 1. The ataxia-telangiectasia mutated (ATM) kinase signals the presence of DNA double-strand breaks in mammalian cells by phosphorylating proteins that initiate cell-cycle arrest, apoptosis, and DNA repair. The Mre11-Rad50-Nbs1 (MRN) complex acts as a double-strand break sensor for ATM and recruits ATM to broken DNA molecules [42]. Activated ATM phosphorylates its downstream cellular targets H2AFX and Chk2 as well as proteins directly involved in DNA repair: XRCC5, TREX1 and NFBD1. G22P1 and PRKDC are subunits of DNA activated protein kinase that can be induced by DNA damage to promote DNA end joining [43]. It also can attenuate CHK2 control of the damage checkpoint [44]. A – The portion of GO classification overlapping with network cluster. The GO classification tree depiction is the same as in Figure 5A. B – The network cluster overlapping with GO classification from Figure A. Highlighted proteins belong to the best overlapping GO group from molecular function classification – telomere capping (GO:0016233). The proteins selected by the blue line belong to the second best overlapping GO group from combined biological processes classification – double-strand break repair (GO:0006302). Gray links indicate DirectRegulation relation, violet links indicate Binding relation, and green arrows represent ProtModification relations.

Similar articles

Cited by

References

    1. Hartwell LH, Hopfield JJ, Leibler S, Murray AW. From molecular to modular cell biology. Nature. 1999;402:C47–52. doi: 10.1038/35011540. - DOI - PubMed
    1. Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA. 2003;100:12123–12128. doi: 10.1073/pnas.2032324100. - DOI - PMC - PubMed
    1. Pereira-Leal JB, Enright AJ, Ouzounis CA. Detection of functional modules from protein interaction networks. Proteins. 2004;54:49–57. doi: 10.1002/prot.10505. - DOI - PubMed
    1. Przulj N, Wigle DA, Jurisica I. Functional topology in a network of protein interactions. Bioinformatics. 2004;20:340–348. doi: 10.1093/bioinformatics/btg415. - DOI - PubMed
    1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. Gene Ontology: Tool for the Unification of Biology. Nature Genetics. 2000;25:25–29. doi: 10.1038/75556. - DOI - PMC - PubMed

Publication types

MeSH terms