Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2002 Jun 1;30(11):2515-23.
doi: 10.1093/nar/30.11.2515.

Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes

Affiliations
Comparative Study

Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes

Nathaniel Echols et al. Nucleic Acids Res. .

Abstract

Based on searches for disabled homologs to known proteins, we have identified a large population of pseudogenes in four sequenced eukaryotic genomes-the worm, yeast, fly and human (chromosomes 21 and 22 only). Each of our nearly 2500 pseudogenes is characterized by one or more disablements mid-domain, such as premature stops and frameshifts. Here, we perform a comprehensive survey of the amino acid and nucleotide composition of these pseudogenes in comparison to that of functional genes and intergenic DNA. We show that pseudogenes invariably have an amino acid composition intermediate between genes and translated intergenic DNA. Although the degree of intermediacy varies among the four organisms, in all cases, it is most evident for amino acid types that differ most in occurrence between genes and intergenic regions. The same intermediacy also applies to codon frequencies, especially in the worm and human. Moreover, the intermediate composition of pseudogenes applies even though the composition of the genes in the four organisms is markedly different, showing a strong correlation with the overall A/T content of the genomic sequence. Pseudogenes can be divided into 'ancient' and 'modern' subsets, based on the level of sequence identity with their closest matching homolog (within the same genome). Modern pseudogenes usually have a much closer sequence composition to genes than ancient pseudogenes. Collectively, our results indicate that the composition of pseudogenes that are under no selective constraints progressively drifts from that of coding DNA towards non-coding DNA. Therefore, we propose that the degree to which pseudogenes approach a random sequence composition may be useful in dating different sets of pseudogenes, as well as to assess the rate at which intergenic DNA accumulates mutations. Our compositional analyses with the interactive viewer are available over the web at http://genecensus.org/pseudogene.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(A) Gene and (B) intergenic region composition for the 20 amino acids and stop signal (*) in the four eukaryotes. Residues are sorted in decreasing order by standard deviation of gene frequencies across the organisms. Human genes are taken from GenomeScan predictions along chromosomes 21 and 22; for other organisms the available complete proteomes have been used. Some gene sequences may include the terminating stop codon, thus there is some variation in the frequency shown for this signal.
Figure 2
Figure 2
Compositon of ΨG in the eukaryotes. The amino acid content of pseudogene predictions is compared with the implied translation of unmasked chromosomes and identified genes. For the human, only chromosomes 21 and 22 are used in the plot shown. In each case, residues have been sorted in order of the difference in frequency between genes and chromosomes [ΔF(genes,intergenic)].
Figure 3
Figure 3
Classifications of pseudogenes. Residues are sorted as above by ΔF(genes,intergenic). (A) Pseudogenes divided into putative processed and duplicated sets. (B) Processed pseudogenes divided into recent and ancient sets based on a median FASTA identity value of 79%.
Figure 4
Figure 4
Arginine codon bias in human chromosomes 21 and 22. Frequency is out of all Arg codons, or Fcodon/FArg.
Figure 5
Figure 5
Sample screen of the online composition browser. The database is accessible through a form that allows selection of any combination of features for which amino acid composition has been determined. Included in the display is a plot of the compositions and statistics for each feature.

References

    1. Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
    1. Harrison P.M., Echols,N. and Gerstein,M.B. (2001) Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome. Nucleic Acids Res., 29, 818–830. - PMC - PubMed
    1. Harrison P., Kumar,A., Lan,N., Echols,N., Snyder,M. and Gerstein,M. (2002) A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolution. J. Mol. Biol., 316, 409–419. - PubMed
    1. Harrison P.M., Hegyi,H., Balasubramanian,S., Luscombe,N.M., Bertone,P., Echols,N., Johnson,T. and Gerstein,M. (2002) Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Res., 12, 272–280. - PMC - PubMed
    1. Harrison P.M. and Gerstein,M. (2002) Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J. Mol. Biol., in press. - PubMed

Publication types

MeSH terms