Review

. 2014 Jun 6:9:10.

doi: 10.1186/1745-6150-9-10.

Profiling the orphan enzymes

Maria Sorokina¹, Mark Stam, Claudine Médigue, Olivier Lespinet, David Vallenet

Affiliations

Affiliation

¹ Direction des Sciences du Vivant, Commissariat à l'Energie Atomique (CEA), Institut de Génomique, Genoscope, Laboratoire d'Analyses Bioinformatiques pour la Génomique et le Métabolisme, 2 rue Gaston Crémieux, 91057 Evry, France. msorokina@genoscope.cns.fr.

PMID: 24906382
PMCID: PMC4084501
DOI: 10.1186/1745-6150-9-10

Review

Profiling the orphan enzymes

Maria Sorokina et al. Biol Direct. 2014.

. 2014 Jun 6:9:10.

doi: 10.1186/1745-6150-9-10.

Authors

Maria Sorokina¹, Mark Stam, Claudine Médigue, Olivier Lespinet, David Vallenet

Affiliation

¹ Direction des Sciences du Vivant, Commissariat à l'Energie Atomique (CEA), Institut de Génomique, Genoscope, Laboratoire d'Analyses Bioinformatiques pour la Génomique et le Métabolisme, 2 rue Gaston Crémieux, 91057 Evry, France. msorokina@genoscope.cns.fr.

PMID: 24906382
PMCID: PMC4084501
DOI: 10.1186/1745-6150-9-10

Abstract

The emergence of Next Generation Sequencing generates an incredible amount of sequence and great potential for new enzyme discovery. Despite this huge amount of data and the profusion of bioinformatic methods for function prediction, a large part of known enzyme activities is still lacking an associated protein sequence. These particular activities are called "orphan enzymes". The present review proposes an update of previous surveys on orphan enzymes by mining the current content of public databases. While the percentage of orphan enzyme activities has decreased from 38% to 22% in ten years, there are still more than 1,000 orphans among the 5,000 entries of the Enzyme Commission (EC) classification. Taking into account all the reactions present in metabolic databases, this proportion dramatically increases to reach nearly 50% of orphans and many of them are not associated to a known pathway. We extended our survey to "local orphan enzymes" that are activities which have no representative sequence in a given clade, but have at least one in organisms belonging to other clades. We observe an important bias in Archaea and find that in general more than 30% of the EC activities have incomplete sequence information in at least one superkingdom. To estimate if candidate proteins for local orphans could be retrieved by homology search, we applied a simple strategy based on the PRIAM software and noticed that candidates may be proposed for an important fraction of local orphan enzymes. Finally, by studying relation between protein domains and catalyzed activities, it appears that newly discovered enzymes are mostly associated with already known enzyme domains. Thus, the exploration of the promiscuity and the multifunctional aspect of known enzyme families may solve part of the orphan enzyme issue. We conclude this review with a presentation of recent initiatives in finding proteins for orphan enzymes and in extending the enzyme world by the discovery of new activities.

PubMed Disclaimer

Figures

**Figure 1**
**Orphan enzyme chronicles.** Studies on orphan enzymatic activities in the past ten years.

**Figure 2**
**EC classification evolution over years. (a)** Snapshot of EC number status by year of creation. This barplot represents the number of created EC numbers over years and the proportion of nowadays active entries in red and transferred/deleted entries in pink. **(b)** Dynamics of the EC entry creations and status changes over years. This barplot represents the number of EC entry modifications over years: creation (yellow), transfer (light red) and deletion (dark red).

**Figure 3**
**Delayed knowledge in the EC classification.** Heatmap of the number of EC entries reported by the year of the activity discovery (X axis) versus the year of the corresponding EC entry creation (Y axis). The square’s shade of gray is proportional to the number of EC entries. A delay can be observed between the discovery of an activity and the creation of the corresponding EC number.

**Figure 4**
**Proportion of orphan EC activities by their year of discovery.** This bar plot represents the proportion of orphans among all discovered EC activities for a given year. In the aim to easily represent their evolution, the data is smoothed by a non-parametric local regression (blue line).

**Figure 5**
**The dynamics of enzyme discovery.** The solid red line represents the number of enzymatic activities by their year of discovery, which is estimated by using the earliest publication linked to the corresponding EC entries in IntEnz database. If no publication is mentioned, the year of creation of the EC entry is used instead. The dotted green line represents the number of activities associated to a biological sequence for the first time. The year of protein-EC number association is estimated using UniProt’s PubMed cross-references and by selecting only articles with less than ten other cited proteins in order to avoid publications related to the sequencing of large genomic regions. The artefact peak in 1961 is due to large number of created entries during the first EC meeting, where many activities were assigned to an EC number without any tractable publication.

**Figure 6**
**Orphan and non-orphan EC number distribution across superkingdoms.** The green pie chart represents the proportion of orphan EC activities among all valid entries. Other pie charts represent the proportion of orphan activities among each superkingdom. An activity is considered as present in a superkingdom if at least one protein is annotated with corresponding EC number or the activity has been observed in an organism according to BRENDA database. The number and percentage of local and global orphans are given for each superkingdom. The small amount of characterized EC numbers in Archaea shows the obvious lack of knowledge about their metabolism.

**Figure 7**
**Proportion of EC activities with new protein domains.** This bar plot represents the proportion of EC numbers having at least one new Pfam domain which was never associated to any enzyme before, by year of discovery. An EC number is considered to be associated to a new domain if this domain has never been seen associated to any other EC number discovered previously. Only EC numbers with at least one associated sequence were taken into account.

See this image and copyright information in PMC

References

1. Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol. 2009;5:e1000605. - PMC - PubMed
1. Payen A, Perzoz J. Mémoire sur la diastase, les principaux produits de ses rèactions, et leurs applications aux arts industriels. Annales de la chimie et de la physique. 1833;53:73–92.
1. Tipton K, Boyce S. History of the enzyme nomenclature system. Bioinformatics. 2000;16:34–40. - PubMed
1. Enzyme nomenclature. http://www.chem.qmul.ac.uk/iubmb/enzyme/
1. McDonald AG, Boyce S, Tipton KF. ExplorEnz: the primary source of the IUBMB enzyme list. Nucleic Acids Res. 2009;37(Database issue):D593–7. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Profiling the orphan enzymes

Affiliation

Profiling the orphan enzymes

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources