Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Dec 12;103(50):19027-32.
doi: 10.1073/pnas.0608796103. Epub 2006 Dec 4.

Almost all human genes resulted from ancient duplication

Affiliations

Almost all human genes resulted from ancient duplication

Roy J Britten. Proc Natl Acad Sci U S A. .

Abstract

Results of protein sequence comparison at open criterion show a very large number of relationships that have, up to now, gone unreported. The relationships suggest many ancient events of gene duplication. It is well known that gene duplication has been a major process in the evolution of genomes. A collection of human genes that have known functions have been examined for a history of gene duplications detected by means of amino acid sequence similarity by using BLASTp with an expectation of two or less (open criterion). Because the collection of genes in build 35 includes sets of transcript variants, all genes of known function were collected, and only the longest transcription variant was included, yielding a 13,298-member library called KGMV (for known genes maximum variant). When all lengths of matches are accepted, >97% of human genes show significant matches to each other. Many form matches with a large number of other different proteins, showing that most genes are made up from parts of many others as a result of ancient events of duplication. To support the use of the open criterion, all of the members of the KGMV library were twice replaced with random protein sequences of the same length and average composition, and all were compared with each other with BLASTp at expectation two or less. The set of matches averaged 0.35% of that observed for the KGMV set of proteins.

PubMed Disclaimer

Conflict of interest statement

The author declares no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Percentage of length of proteins in longest matches at two criteria. The y axis represents the percent of length in matches. The x axis represents the percentage of the KGMV library. The lower curve expectation is ≤10−3. The upper curve expectation is ≤2. The maximum length matched is plotted for each protein, ordered by percentage of length matched, forming a continuous curve because so many thousands of proteins are plotted.
Fig. 2.
Fig. 2.
The percentage of each of the proteins included in all matches at expectation two or less. The y axis represents the percentage of length of each protein. The x axis represents the percentage of the KGMV library. The lower curve (copied from Fig. 1 for comparison) is the percentage of length covered in single matches. The upper curve is the percentage of length covered in all matches. Each curve is a plot of all of the proteins that are matched ordered independently by the percentage of length matched. For the upper curve, an array was made for all the amino acids in the probe, and each amino acid was marked if newly included in a match. The percentage of length matched is the sum of all marked amino acids times 100 divided by the length of the protein.
Fig. 3.
Fig. 3.
The positions of the alignments with protein (EDD1), NP056986, NM_015902. The individual alignments with this probe were scanned, and any match that included alignment with amino acids not previously matched was plotted, starting at the bottom. There are 34 such matches, and, in 5 cases, the same matching protein was included more than once because the alignments reported by BLASTp were significantly different. The heavy lines are matches with expectation of 10−3 or less. The next weight of lines have an expectation equal to one or less and >10−3. The two thin lines at the top have an expectation equal to two or less and greater than one.
Fig. 4.
Fig. 4.
Precision of match at open criterion. The x axis represents the percentage of the KGMV library. On the y axis, the upper curves represent the percentage of protein length matched, and for the lower curve, the scale represents the percentage of amino acids matched averaged for 100 proteins each. The proteins have been collected in sets of 100 each to reduce scatter. Other than this exception, the upper curves are identical to those in Fig. 2. The reason for the curious shape at the beginning is that the UNIX sort program ordered on the basis of percent amino acid match all those that were matched for 100% of their length. Except for a few with high percentage length matched, the average percentage amino acid match is ≈32%.
Fig. 5.
Fig. 5.
The percentage of random proteins included in all matches; control for open criterion. The description of this figure is exactly as for Fig. 2, except that the lower curve is from an all-to-all comparison of a 13,298-member random amino acid library matching the KGMV library in length and composition (on average). In this example, there were 22,340 matches among the random amino acid sequences at expectation two or less, whereas there were 5,200,000 matches for the upper curve.
Fig. 6.
Fig. 6.
Coverage of individual amino acids of probes in the many matches. The horizontal scale is the percentage of the KGMV library. The upper curve is identical to the upper curve of Fig. 2, and for this curve the vertical scale is the percentage of the length covered. The lower heavy curve describes the individual amino acids covered, and the right-hand scale for this curve is the percentage of individual amino acids included in the many matches.

Similar articles

Cited by

References

    1. Ohno S. Evolution by Gene Duplication. New York: Springer; 1970.
    1. Britten RJ. Carnegie Institution Yearbook 64. Washington, DC: Carnegie Institution; 1965. p. 333.
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Zimmer EA, Martin SL, Beverley SM, Kan YW, Wilson AC. Proc Natl Acad Sci USA. 1980;77:2156–2162. - PMC - PubMed
    1. Castresana J. Nucleic Acids Res. 2002;30:1751–1756. - PMC - PubMed

LinkOut - more resources