Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug 12:2022:baac062.
doi: 10.1093/database/baac062.

A roadmap for the functional annotation of protein families: a community perspective

Affiliations

A roadmap for the functional annotation of protein families: a community perspective

Valérie de Crécy-Lagard et al. Database (Oxford). .

Abstract

Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Different existing resources to separate isofunctional families. The top three panels show different methods based on sequence similarities to try and identify subgroups. The bottom panel focuses on rule-based approaches. (A) SSN example from RadicalSAM.org with protein as nodes linked by an edge if they are similar within a certain threshold that shows the separation of the members of the Radical SAM superfamily; some subgroups cannot be separated as seen in Megaclusters 1–5; some are distinct as seen in Clusters 6–10; (B) network representation of the HIGH-signature proteins, UspA, and PP-ATPase (HUP) Superfamily (CATH 3.40.50.620) showing available functional annotations in FunFams. The colored nodes indicate FunFams annotated with different EC numbers, and the gray nodes indicate FunFams without any EC annotation, which includes nonenzymes [Figure from (127)]; (C) GO Phylogenetic Annotation: annotations of gains and losses of functions on ancestral nodes in the tree, based on experimental annotations (left) lead to different function annotations of uncharacterized proteins depending on their evolutionary history (right); (D) UniRule generation platform.
Figure 2.
Figure 2.
(A) RCSB PDB converts global data into global knowledge. (B) INDRA performs knowledge assembly from the biomedical literature and expert-curated databases into a knowledge base of mechanistic statements that can be converted into models and networks and queried through human–machine dialogue.
Figure 3.
Figure 3.
Using phylogenetic relationships to guide the integration of data associated with related proteins, mining of genomic and post-genomic data can seed defined hypotheses for the discovery of molecular and biological functions associated with genes/proteins of unknown or uncertain function.
Figure 4.
Figure 4.
Questions discussed during the five sessions.

References

    1. Altaf-Ul-Amin M., Afendi F.M., Kiboi S.K.. et al. (2014) Systems biology in the context of big data and networks. Biomed. Res. Int., 2014, 428570.doi: 10.1155/2014/428570. - DOI - PMC - PubMed
    1. Stephens Z.D., Lee S.Y., Faghri F.. et al. (2015) Big data: astronomical or genomical? PLoS Biol., 13, e1002195.doi: 10.1371/journal.pbio.1002195. - DOI - PMC - PubMed
    1. Médigue C., Calteau A., Cruveiller S.. et al. (2019) MicroScope-an integrated resource for community expertise of gene functions and comparative analysis of microbial genomic and metabolic data. Brief. Bioinformat., 20, 1071–1084.doi: 10.1093/bib/bbx113. - DOI - PMC - PubMed
    1. Vanni C., Schechter M.S., Acinas S.G.. et al. (2022) Unifying the known and unknown microbial coding sequence space. Elife, 11, e67667.doi: 10.7554/eLife.67667. - DOI - PMC - PubMed
    1. Giani A.M., Gallo G.R., Gianfranceschi L.. et al. (2020) Long walk to genomics: history and current approaches to genome sequencing and assembly. Comput. Struct. Biotech. J., 18, 9–19.doi: 10.1016/j.csbj.2019.11.002. - DOI - PMC - PubMed

Publication types