Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2009 Dec 14;425(1):1-11.
doi: 10.1042/BJ20091328.

'Unknown' proteins and 'orphan' enzymes: the missing half of the engineering parts list--and how to find it

Affiliations
Review

'Unknown' proteins and 'orphan' enzymes: the missing half of the engineering parts list--and how to find it

Andrew D Hanson et al. Biochem J. .

Abstract

Like other forms of engineering, metabolic engineering requires knowledge of the components (the 'parts list') of the target system. Lack of such knowledge impairs both rational engineering design and diagnosis of the reasons for failures; it also poses problems for the related field of metabolic reconstruction, which uses a cell's parts list to recreate its metabolic activities in silico. Despite spectacular progress in genome sequencing, the parts lists for most organisms that we seek to manipulate remain highly incomplete, due to the dual problem of 'unknown' proteins and 'orphan' enzymes. The former are all the proteins deduced from genome sequence that have no known function, and the latter are all the enzymes described in the literature (and often catalogued in the EC database) for which no corresponding gene has been reported. Unknown proteins constitute up to about half of the proteins in prokaryotic genomes, and much more than this in higher plants and animals. Orphan enzymes make up more than a third of the EC database. Attacking the 'missing parts list' problem is accordingly one of the great challenges for post-genomic biology, and a tremendous opportunity to discover new facets of life's machinery. Success will require a co-ordinated community-wide attack, sustained over years. In this attack, comparative genomics is probably the single most effective strategy, for it can reliably predict functions for unknown proteins and genes for orphan enzymes. Furthermore, it is cost-efficient and increasingly straightforward to deploy owing to a proliferation of databases and associated tools.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Scale and relentless growth of the unknown protein and orphan enzyme problems
(A) The percentages of known and unknown proteins encoded by representative genomes. The numbers of known proteins were estimated from the SEED database by summing protein-encoding genes included in subsystems and those with non-hypothetical functions not in subsystems. Since this assumes that all proteins in subsystems have known functions, and some such functions are merely reasonable hypotheses, this gives a generous estimate of known proteins. (B) A qualitative sketch of the relationship between the number of conserved unknown proteins and the number of genomes sequenced, from an exploratory analysis by R. Overbeek and A. L. Osterman (personal communication). The SEED database was used to estimate the number of protein families (corresponding roughly to orthologues) comprising at least five members from genomes representing two or more genera (thereby excluding very local families). A jackknife approach was then used to compute an average number of families (blue curve) in bundles (‘runs’) of progressively increased size from 20 genomes up to 650 genomes per run. The lower curve in red shows the number of families having at least some elements of function assigned (i.e. at least a general function such as ‘sugar kinase’ deduced from homology). Note again the generosity of this estimate of the number of proteins that have a known function. The yellow area between the curves represents the number of unknown families. (C) Cumulative total numbers of biochemical activities (EC numbers) characterized between 1950 and 2000, and those that are still orphans. Data are derived from Figure 1 of [33].
Figure 2
Figure 2. Types of associations in comparative genomics
Multiple types of evidence gathered from genomic and post-genomic resources are integrated in order to make predictions on gene function. The more lines of evidence converge, the more robust predictions become.
Figure 3
Figure 3. The genome deluge and its implications
Statistics are from the Genomes OnLine Database (http://genomesonline.org/index2.htm). (A) Progress in genome sequencing since 1997. The inset plots the square of the number of completely sequenced genomes against time; this value is roughly proportional to the potential for recognizing functional associations from genome data. Note its explosive growth since about 2006. The incomplete genomes include several hundred comprehensive EST (expressed sequence tag) projects. (B) Taxonomic breakdown of the 1067 genomes completed by August 2009. (C) Taxonomic breakdown of 3446 full genome sequencing projects that were ongoing (incomplete) as of August 2009. EST and genome survey projects are excluded from the total.
Figure 4
Figure 4. FolQ, a missing folate synthesis enzyme in bacteria and plants
(A) The tetrahydrofolate biosynthesis pathway and its branches leading to queuosine and pterins. Note that the previously missing enzyme FolQ (YlgG in Lactococcus lactis) is the first step unique to the folate pathway. Abbreviations: BH4, 5,6,7,8-tetrahydrobiopterin; CPH4, 6-carboxy-5,6,7,8-tetrahydropterin; DHF, 7,8-dihydrofolate; DHM, 7,8-dihydromonapterin; DHN, 7,8-dihydroneopterin; DHP, 7,8-dihydropteroate; Glu, glutamate; HMDHP, 6-hydroxymethyl-7,8-dihydropterin; MH4, 5,6,7,8-tetrahydromonapterin; -P, phosphate; -P2, pyrophosphate; -P3, triphosphate; pABA, p-aminobenzoate; P-ase, non-specific phosphatase; PTP, 6-pyruvoyl-5,6,7,8-tetrahydropterin; Que, queuosine; THF, 5,6,7,8-tetrahydrofolate; THF-Glun, tetrahydrofolate polyglutamates. (B) Clustering in operonic arrangements of folQ with genes encoding other folate synthesis enzymes in two lactobacteria (phylum Firmicutes) and Leptotrichia buccalis (phylum Fusobacteria). Arrows indicate transcriptional direction; overlapping arrows indicate translational coupling. Genes are colour-coded in agreement with (A); non-conserved genes are coloured grey. A short intervening block of six genes separates the two clusters of folate synthesis genes in L. buccalis. The rose highlight linking the bacterial folQ genes and the vertical triangle represent the projection of the bacterial gene function to plants.
Figure 5
Figure 5. A tryptophan to quinolinate pathway in bacteria
(A) The two biosynthetic routes to quinolinate: the five-step ‘eukaryotic’ route and the two-step ‘prokaryotic’ one. ACM semialdehyde, 2-amino-3-carboxymuconate semialdehyde. Conversion of ACM semialdehyde into quinolinate (asterisked) is non-enzymatic. (B) Schematic profile of the presence and absence of the seven genes of the ‘eukaryotic’ and ‘prokaryotic’ pathways among two representative eukaryotes, two representative bacteria with the ‘prokaryotic’ pathway (Escherichia coli and Bacillus subtilis), and three bacteria with the ‘eukaryotic’ pathway (Polaribacter filamentus, Gemmata sp., and Xanthomonas axonopodis). +, gene present; -, gene absent. (C) Clustering in operonic arrangements of various ‘eukaryotic’ pathway genes in representative bacteria. Arrows indicate transcriptional direction. Non-conserved genes are coloured grey.
Figure 6
Figure 6. Leucine catabolism in bacteria and humans
(A) Enzymatic steps involved in the later steps of leucine catabolism and the metabolism of hydroxymethylglutaryl-CoA (HMG-CoA). Intermediates are shown by chemical names. Enzymes conserved in humans, Bacillus subtilis and certain other bacteria are as follows: IVD, isovaleryl-CoA dehydrogenase (EC 1.3.99.10); MCCC, methylcrotonoyl-CoA carboxylase (EC 6.4.1.4) [2a/2c, biotin-carboxylase subunit/biotin carboxyl carrier domain/subunit; 2b, carboxyl transferase subunit]; MGCH, methylglutaconyl-CoA hydratase (EC 4.2.1.18); HMGCL, hydroxymethylglutaryl-CoA lyase (EC 4.1.3.4); and AACS, acetoacetate-CoA synthetase (EC 6.2.1.16). These five enzymes are colour-coded. Two enzymes related to the mevalonate pathway of isoprenoid/sterol biosynthesis (present in humans, but not B. subtilis) are shown in grey: HMGCS, HMG-CoA synthase (EC 2.3.3.10); and HMGCR, HMG-CoA reductase (EC 1.1.1.34). Enzymes catalysing early steps of leucine catabolism (from leucine to isovaleryl-CoA) are similar in humans and bacteria (not shown). (B) Projection of functional assignments between human and bacterial genes. Gene names (human and B. subtilis) corresponding to pathway enzymes are colour-coded and numbered in agreement with (A). The reasoning used in analysing leucine catabolism in bacteria is illustrated by arrowheads pointing in the direction of functional projections. Vertical triangles with red lettering correspond to unambiguous projections based on orthology (same specific function). The triangles point in the direction of the projection. Vertical triangles with black lettering indicate homologues that belong to large families that contain multiple paralogues that share a ‘general class’ function, but differ in substrate specificity. Horizontal triangles indicate conjectures based on gene clustering (refinement of ‘general class’ functions and genuine functional predictions). (C) Large operon-like clusters of genes related to leucine catabolism detected in a number of Gram-positive and Gram-negative bacteria. Conserved homologous genes are colour-coded and numbered in agreement with (A) and (B). Genes without homologues in a given chromosomal neighbourhood are coloured grey.
Figure 7
Figure 7. A hypothetical shortcut to the plant choline-oxidizing enzyme
(A) The choline to glycine betaine pathway in bacteria and plants, and related reactions of choline metabolism in bacteria. BetA, choline dehydrogenase (EC 1.1.99.1); BetB, betaine aldehyde dehydrogenase (EC 1.2.1.8); BetC, choline sulfatase (EC 3.1.6.6); Bmt, betaine–homocysteine S-methyltransferase (EC 2.1.1.5); CMO, choline mono-oxygenase (EC 1.14.15.7), a Rieske-type [2Fe–2S] protein; DMGO, dimethylglycine oxidase (EC 1.5.3.10); Fd, ferredoxin; FNR, ferredoxin–NADP+ reductase (EC 1.18.1.2); GlyA, serine hydroxymethyltransferase (EC 2.1.2.1); MSOX, monomeric sarcosine oxidase; MttB, homologue of trimethylamine methyltransferase often clustered with dimethylglycine oxidase; TSOX, heterotetrameric sarcosine oxidase. Other bacterial enzymes (not shown) that mediate oxidation of choline to betaine aldehyde are choline oxidase (EC 1.1.3.17) and GbsB, a soluble, NAD-linked type III alcohol dehydrogenase. (B) Typical clustering arrangements of the choline–glycine betaine pathway genes betA and betB with betI (encoding a transcriptional repressor) and betT (encoding a choline transporter) or betC. Genes are colour-coded in agreement with (A). (C) Clustering in diverse bacteria of genes for Rieske-type proteins homologous with choline mono-oxygenase with up to 13 different genes of choline metabolism. Genes are colour-coded in agreement with (A) and (B). The rose highlight linking the bacterial Rieske-type genes and the vertical triangle represent the projection of the hypothetical bacterial gene function (choline oxidation) to plant choline mono-oxygenases. The gene labelled opuAC encodes a homologue of the periplasmic choline-binding component of an ABC (ATP-binding cassette) transporter. The genes labelled α, β, γ, and δ encode the four subunits of heterotetrameric sarcosine oxidase. Non-conserved genes are coloured grey.

Similar articles

Cited by

References

    1. Stephanopoulos GN, Aristidou AA, Nielsen J. Metabolic Engineering: Principles and Methodologies. Academic Press; San Diego: 1998.
    1. Hanson AD, Shanks JV. Plant metabolic engineering: entering the S curve. Metab Eng. 2002;4:1–2.
    1. Capell T, Christou P. Progress in plant metabolic engineering. Curr Opin Biotechnol. 2004;15:148–154. - PubMed
    1. Wu S, Chappell J. Metabolic engineering of natural products in plants; tools of the trade and challenges for the future. Curr Opin Biotechnol. 2008;19:145–152. - PubMed
    1. Kunze R, Frommer WB, Flügge UI. Metabolic engineering of plants: the role of membrane transport. Metab Eng. 2002;4:57–66. - PubMed

Publication types