Comparative Study

. 2002 Jun 1;30(11):2515-23.

doi: 10.1093/nar/30.11.2515.

Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes

Nathaniel Echols¹, Paul Harrison, Suganthi Balasubramanian, Nicholas M Luscombe, Paul Bertone, Zhaolei Zhang, Mark Gerstein

Affiliations

PMID: 12034841
PMCID: PMC117176
DOI: 10.1093/nar/30.11.2515

Comparative Study

Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes

Nathaniel Echols et al. Nucleic Acids Res. 2002.

. 2002 Jun 1;30(11):2515-23.

doi: 10.1093/nar/30.11.2515.

Authors

Nathaniel Echols¹, Paul Harrison, Suganthi Balasubramanian, Nicholas M Luscombe, Paul Bertone, Zhaolei Zhang, Mark Gerstein

Affiliation

¹ Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, Box 208114, New Haven, CT 06520-8114, USA.

PMID: 12034841
PMCID: PMC117176
DOI: 10.1093/nar/30.11.2515

Abstract

Based on searches for disabled homologs to known proteins, we have identified a large population of pseudogenes in four sequenced eukaryotic genomes-the worm, yeast, fly and human (chromosomes 21 and 22 only). Each of our nearly 2500 pseudogenes is characterized by one or more disablements mid-domain, such as premature stops and frameshifts. Here, we perform a comprehensive survey of the amino acid and nucleotide composition of these pseudogenes in comparison to that of functional genes and intergenic DNA. We show that pseudogenes invariably have an amino acid composition intermediate between genes and translated intergenic DNA. Although the degree of intermediacy varies among the four organisms, in all cases, it is most evident for amino acid types that differ most in occurrence between genes and intergenic regions. The same intermediacy also applies to codon frequencies, especially in the worm and human. Moreover, the intermediate composition of pseudogenes applies even though the composition of the genes in the four organisms is markedly different, showing a strong correlation with the overall A/T content of the genomic sequence. Pseudogenes can be divided into 'ancient' and 'modern' subsets, based on the level of sequence identity with their closest matching homolog (within the same genome). Modern pseudogenes usually have a much closer sequence composition to genes than ancient pseudogenes. Collectively, our results indicate that the composition of pseudogenes that are under no selective constraints progressively drifts from that of coding DNA towards non-coding DNA. Therefore, we propose that the degree to which pseudogenes approach a random sequence composition may be useful in dating different sets of pseudogenes, as well as to assess the rate at which intergenic DNA accumulates mutations. Our compositional analyses with the interactive viewer are available over the web at http://genecensus.org/pseudogene.

PubMed Disclaimer

Figures

**Figure 1**
(A) Gene and (B) intergenic region composition for the 20 amino acids and stop signal (*) in the four eukaryotes. Residues are sorted in decreasing order by standard deviation of gene frequencies across the organisms. Human genes are taken from GenomeScan predictions along chromosomes 21 and 22; for other organisms the available complete proteomes have been used. Some gene sequences may include the terminating stop codon, thus there is some variation in the frequency shown for this signal.

**Figure 2**
Compositon of ΨG in the eukaryotes. The amino acid content of pseudogene predictions is compared with the implied translation of unmasked chromosomes and identified genes. For the human, only chromosomes 21 and 22 are used in the plot shown. In each case, residues have been sorted in order of the difference in frequency between genes and chromosomes [ΔF(genes,intergenic)].

**Figure 3**
Classifications of pseudogenes. Residues are sorted as above by ΔF(genes,intergenic). (A) Pseudogenes divided into putative processed and duplicated sets. (B) Processed pseudogenes divided into recent and ancient sets based on a median FASTA identity value of 79%.

**Figure 4**
Arginine codon bias in human chromosomes 21 and 22. Frequency is out of all Arg codons, or F_codon/F_Arg.

**Figure 5**
Sample screen of the online composition browser. The database is accessible through a form that allows selection of any combination of features for which amino acid composition has been determined. Included in the display is a plot of the compositions and statistics for each feature.

See this image and copyright information in PMC

References

1. Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
1. Harrison P.M., Echols,N. and Gerstein,M.B. (2001) Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome. Nucleic Acids Res., 29, 818–830. - PMC - PubMed
1. Harrison P., Kumar,A., Lan,N., Echols,N., Snyder,M. and Gerstein,M. (2002) A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolution. J. Mol. Biol., 316, 409–419. - PubMed
1. Harrison P.M., Hegyi,H., Balasubramanian,S., Luscombe,N.M., Bertone,P., Echols,N., Johnson,T. and Gerstein,M. (2002) Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Res., 12, 272–280. - PMC - PubMed
1. Harrison P.M. and Gerstein,M. (2002) Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J. Mol. Biol., in press. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
Medical
- MedlinePlus Health Information
Molecular Biology Databases
- FlyBase
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes

Affiliation

Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Molecular Biology Databases