A genome-wide survey of human pseudogenes

David Torrents¹, Mikita Suyama, Evgeny Zdobnov, Peer Bork

Affiliations

PMID: 14656963
PMCID: PMC403797
DOI: 10.1101/gr.1455503

A genome-wide survey of human pseudogenes

David Torrents et al. Genome Res. 2003 Dec.

. 2003 Dec;13(12):2559-67.

doi: 10.1101/gr.1455503.

Authors

David Torrents¹, Mikita Suyama, Evgeny Zdobnov, Peer Bork

Affiliation

¹ EMBL, Heidelberg 69117, Germany.

PMID: 14656963
PMCID: PMC403797
DOI: 10.1101/gr.1455503

Abstract

We screened all intergenic regions in the human genome to identify pseudogenes with a combination of homology searches and a functionality test using the ratio of silent to replacement nucleotide substitutions (KA/KS). We identified 19,724 regions of which 95% +/- 3% are estimated to evolve neutrally and thus are likely to encode pseudogenes. Half of these have no detectable truncation in their pseudocoding regions and therefore are not identifiable by methods that require the presence of truncations to prove nonfunctionality. A comparative analysis with the mouse genome showed that 70% of these pseudogenes have a retrotranspositional origin (processed), and the rest arose by segmental duplication (nonprocessed). Although the spread of both types of pseudogenes correlates with chromosome size, nonprocessed pseudogenes appear to be enriched in regions with high gene density. It is likely that the human pseudogenes identified here represent only a small fraction of the total, which probably exceeds the number of genes.

PubMed Disclaimer

Figures

**Figure 1**
General overview of the strategy for pseudogene search and evaluation. Our analysis can be divided into three different parts: homology search, analysis of orthology for the selection of K_A/K_S benchmark sets, and the functionality test based on K_A/K_S. Green, red, and blue boxes denote the intermediate steps, the excluded sequences, and the final results for each of the sections, respectively. See text for details.

**Figure 2**
K_A/K_S distributions of benchmark and candidate sets. The K_A/K_S distributions (as log K_A/K_S) associated with the functional (green) and pseudogenic (red) benchmark sets (A) as well as the test sequence set (B) are shown. An average of 40% of the sequences analyzed in this study satisfied our requirements for the K_A/K_S calculation. The subsets of sequences with K_A/K_S values (1659 for the functional, 1703 for the pseudogenic benchmark sets, and 3291 for the test set) are expected to be representative for each of the corresponding complete sets, as what determines whether a K_A/K_S value can be calculated for a sequence (availability of homologous sequences and restrictions on the K_A/K_S calculation; see Methods) is likely to equally affect genes and pseudogenes. By using the least-squares fitting against the benchmark distributions, we evaluated the fraction of pseudogenic (red) and functional (green) sequences for each of the bins of the test distribution and combined them to determine that up to 95% of the sequences analyzed correspond to pseudogenes.

**Figure 3**
Distribution of genes and the different types of pseudogenes for each of the human chromosomes. We have displayed for each human chromosome the number of pseudogenes (separated in different types; see chart legend for details) and genes per megabase. Chromosomes have been ordered according to the density of pseudogenes (highest on *top*).

See this image and copyright information in PMC

References

1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402. - PMC - PubMed
1. Birney, E. and Durbin, R. 1997. Dynamite: A flexible code generating language for dynamic programming methods used in sequence comparison. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5: 56-64. - PubMed
1. Brosius, J. 1999. RNAs from all categories generate retrosequences that may be exapted as novel genes or regulatory elements. Gene 238: 115-134. - PubMed
1. Bustamante, C.D., Nielsen, R., and Hartl, D.L. 2002. A maximum likelihood method for analyzing pseudogene evolution: Implications for silent site evolution in humans and rodents. Mol. Biol. Evol. 19: 110-117. - PubMed
1. Collins, J.E., Goward, M.E., Cole, C.G., Smink, L.J., Huckle, E.J., Knowles, S., Bye, J.M., Beare, D.M., and Dunham, I. 2003. Reevaluating human gene annotation: A second-generation analysis of chromosome 22. Genome Res. 13: 27-36. - PMC - PubMed

WEB SITE REFERENCES

1. ftp://ftp.ncbi.nih.gov/genomes/; NCBI.
1. http://www.bork.embl-heidelberg.de/Docu/Human_Pseudogenes/; authors' Web site.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A genome-wide survey of human pseudogenes

Affiliation

A genome-wide survey of human pseudogenes

Authors

Affiliation

Abstract

Figures

References

WEB SITE REFERENCES

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources