Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jul 7;106(27):11079-84.
doi: 10.1073/pnas.0905029106. Epub 2009 Jun 18.

Nature of the protein universe

Affiliations

Nature of the protein universe

Michael Levitt. Proc Natl Acad Sci U S A. .

Abstract

The protein universe is the set of all proteins of all organisms. Here, all currently known sequences are analyzed in terms of families that have single-domain or multidomain architectures and whether they have a known three-dimensional structure. Growth of new single-domain families is very slow: Almost all growth comes from new multidomain architectures that are combinations of domains characterized by approximately 15,000 sequence profiles. Single-domain families are mostly shared by the major groups of organisms, whereas multidomain architectures are specific and account for species diversity. There are known structures for a quarter of the single-domain families, and >70% of all sequences can be partially modeled thanks to their membership in these families.

PubMed Disclaimer

Conflict of interest statement

The author declares no conflict of interest.

Figures

Fig. 1.
Fig. 1.
As the NR database grows, the number of different multidomain architecture (MDA) families found by CDART is increasing rapidly with year (Left) or added sequence (Right). In contrast, the number of single-domain architecture (SDA) families is increasing much more slowly. Because the number of sequences is growing exponentially, fractional sequence coverage (number of sequences in a SDA or MDA family divided by the total number of NR sequences) has dropped slightly from 0.88 to 0.76; more than three-quarters of current sequences still contain a domain recognized by a known sequence profile. Merged CDART sequence profiles are used here. Corresponding results with unmerged CDART sequence profiles are given in Fig. S1. The solid curves marked “2008” were made with a release of CDART from February 9, 2008, which contained fewer sequence profiles (24,083 compared with 27,036). This gave rise to smaller numbers of SDA and MDA families and lower coverage. During this time, the number of sequences in the NR database increased by 2 million.
Fig. 2.
Fig. 2.
Unique and repetitious structural coverage as a function of year and size of the sequence database. Coverage is the percentage of single-domain architecture (SDA) families containing at least one sequence of known three-dimensional structure (in the PDB). For unique coverage, we count each family once, whereas for repetitious coverage we count every sequence in the family. If all of the known structures belonging to a particular family are determined by structural genomics, then that family is counted in structural genomics coverage. (If any structure of a family is not from structural genomics, then the entire family is not.). (Left) Unique coverage with merged CDART sequence profiles increasing from 17% in 1980 to 26% now, with a 5% increase since 2004 due to structural genomics. (Right) This increase in coverage occurred during a period when the number of sequences increased 900-fold (from 8,600 to 7.6 million) The upper curves show corresponding data for repetitious coverage that are higher at 71%; this is expected because larger families are more likely to contain a member with a known structure. It is an indication of the maximum number of sequences (4.2 million) that could be modeled by homology. (Center) Coverage with unmerged sequence profiles is significantly lower (22% and 54% for unique and repetitious coverage, respectively); this is expected because families are smaller with unmerged sequence profiles and less likely to contain a member with a known structure.
Fig. 3.
Fig. 3.
Scaled Venn diagrams of the numbers of single-domain architecture (SDA) and multidomain architecture (MDA) families for the three major organism groups of life: prokaryotes, eukaryotes, and viruses. (Upper Left) For SDA families, there is a good deal of commonality, with 64% of SDAs shared between two or more groups. (Upper Right) For MDA families, the situation is very different, with 96% of MDAs unique to a particular group. The larger eukaryote disk in Upper Left compared with Upper Right shows that although prokaryotes have the highest fraction of SDA families (88%), eukaryotes have the highest fraction of MDA families (68%). The very small number of shared MDAs in Upper Right (4%) shows the relationship that MDAs have to evolutionary diversity. Results with merged sequence profiles are very similar in that Lower Left and Lower Right have corresponding percentages of 61%, 94%, 85%, 68%, and 6%, respectively. The MDA panels are drawn on a different scale from the SDA panels; the area of the prokaryote disk is kept fixed to facilitate comparison.
Fig. 4.
Fig. 4.
Although the fraction of MDA families with a particular number of members has a power-law dependence on the family size (as shown by the linear log–log plots), the fraction of SDA families with a particular number of members does not. For MDA families, the fraction of families with m members varies as m−2.09. For small SDA families, the fraction drops much more slowly than that for large SDA families (varies as m−0.18 for m < 32 and then as m−2.57 for m > 64).
Fig. 5.
Fig. 5.
Illustrations of sequence space in which area is proportional to the number of sequences or sequence families in that region. Sequences not characterized by any merged CDART sequence profile are the dark matter of the protein universe (23% of 7,500,000, the gray core). (Top Left) The unique sequence universe contains all sequence families. Eighty-six percent of the families are MDAs, and the other 14% are SDAs. Thirty-two percent of SDA sequence families have a known structure, with one-fifth of these from structural genomics. For 49% of the MDAs, all domains have a known structure (hatched), and another 42% have at least one domain with a known structure (part PDB). (Top Right) The repetitious sequence universe contains all sequences. Most characterized sequences (88%, orange area) have single domain architectures (SDAs), where one region of the sequence is matched by a sequence profile (colored bar on black line). The remainder (12%, blue area) have multidomain architectures (MDAs), with more than one region of the sequence matched (several colored bars on sequence). Over three-quarters (76%) of the SDA sequences are matched by a sequence profile family that has a known three-dimensional structure, and 4% of the SDA sequences were solved by structural genomics (brown area, hatching indicates domain of known structure). (Middle) Numbers of sequences in the corresponding regions of Top Right. (Bottom) Numbers of families in the corresponding regions of Top Left.

Similar articles

Cited by

References

    1. Ladunga I. Phylogenetic continuum indicates galaxies in the protein universe: Preliminary results on the natural group structures of proteins. J Mol Evol. 1992;4:358–375. - PubMed
    1. Sanger F. Arrangement of amino acids in proteins. Adv Protein Chem. 1952;7:1–66. - PubMed
    1. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
    1. Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17:282–283. - PubMed

Publication types