. 2009 Jul 30;4(7):e6456.

doi: 10.1371/journal.pone.0006456.

The random nature of genome architecture: predicting open reading frame distributions

Michael W McCoy¹, Andrew P Allen, James F Gillooly

Affiliations

PMID: 19649247
PMCID: PMC2714469
DOI: 10.1371/journal.pone.0006456

The random nature of genome architecture: predicting open reading frame distributions

Michael W McCoy et al. PLoS One. 2009.

. 2009 Jul 30;4(7):e6456.

doi: 10.1371/journal.pone.0006456.

Authors

Michael W McCoy¹, Andrew P Allen, James F Gillooly

Affiliation

¹ Department of Biology, Boston University, Boston, Massachusetts, United States of America. mwmccoy@bu.edu

PMID: 19649247
PMCID: PMC2714469
DOI: 10.1371/journal.pone.0006456

Abstract

Background: A better understanding of the size and abundance of open reading frames (ORFS) in whole genomes may shed light on the factors that control genome complexity. Here we examine the statistical distributions of open reading frames (i.e. distribution of start and stop codons) in the fully sequenced genomes of 297 prokaryotes, and 14 eukaryotes.

Methodology/principal findings: By fitting mixture models to data from whole genome sequences we show that the size-frequency distributions for ORFS are strikingly similar across prokaryotic and eukaryotic genomes. Moreover, we show that i) a large fraction (60-80%) of ORF size-frequency distributions can be predicted a priori with a stochastic assembly model based on GC content, and that (ii) size-frequency distributions of the remaining "non-random" ORFs are well-fitted by log-normal or gamma distributions, and similar to the size distributions of annotated proteins.

Conclusions/significance: Our findings suggest stochastic processes have played a primary role in the evolution of genome complexity, and that common processes govern the conservation and loss of functional genomics units in both prokaryotes and eukaryotes.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Fits of the 2 mixture models (Eqs. 1–2) to the genomes of three representative taxa.**
(a) *Escherichia coli*, a prokaryote, (b) *Yarrowia lypolytica*, a unicellular eukaryote, and (c) *Drosophila melanogaster*, a multicellular eukaryote.

**Figure 2. The size distributions of small ORFs in 311 whole genomes of prokaryotes and eukaryotes are consistent with random expectations (each point represents a genome).**
Observed values obtained by fitting the exponential components of the mixture models ( in Eqs. 1–2) were linearly related to the expected value for a random sequence of a given GC content, , with a slope statistically indistinguishable from 1 and an intercept near 0 (P>0.05, r² = 0.92).

formula image — **Figure 2. The size distributions of small ORFs in 311 whole genomes of prokaryotes and eukaryotes are consistent with random expectations (each point represents a genome).**
Observed values obtained by fitting the exponential components of the mixture models ( in Eqs. 1–2) were linearly related to the expected value for a random sequence of a given GC content, , with a slope statistically indistinguishable from 1 and an intercept near 0 (P>0.05, r² = 0.92).

**Figure 3. The relationship between ORFs and genome characteristics.**
Panels a and b: show the relationships between the total number of random ORFs versus genome size and GC content. Panels c and d: Show the relationships between the fraction of all ORFs that are randomly generated (p in Eqs. 1–2) versus genome size and GC content. Data were fitted using generalized additive models with non-parametric smoothing functions. Dashed lines represent 95% point wise confidence intervals.

**Figure 4. The relationship between parameters estimated from the mixture models and annotated.**
Panels a and b: Show the relationships between the parameters μ and σ of the lognormal distribution estimated from the mixture model fits with the μ and σ parameters estimated from fits to annotated proteins. Panels c and d: Show the relationships between parameters α and β of the gamma distribution estimated from the mixture model fits with the α and β parameters estimated from fits to annotated proteins. Data were fitted using analysis of covariance.

**Figure 5. Relationship between the numbers of non-random ORFs based on the mixture model fits with the number of annotated proteins for 311 prokaryotic and eukaryotic genomes.**
Best fit lines determined from Analysis of Covariance where the dashed line represents the fit for prokaryotes, and the solid line represents the fit for eukaryotes.

See this image and copyright information in PMC

Cited by

Genome sizes and the Benford distribution.
Friar JL, Goldman T, Pérez-Mercader J. Friar JL, et al. PLoS One. 2012;7(5):e36624. doi: 10.1371/journal.pone.0036624. Epub 2012 May 18. PLoS One. 2012. PMID: 22629319 Free PMC article.
Predicting statistical properties of open reading frames in bacterial genomes.
Mir K, Neuhaus K, Scherer S, Bossert M, Schober S. Mir K, et al. PLoS One. 2012;7(9):e45103. doi: 10.1371/journal.pone.0045103. Epub 2012 Sep 24. PLoS One. 2012. PMID: 23028785 Free PMC article.
Alu distribution and mutation types of cancer genes.
Zhang W, Edwards A, Fan W, Deininger P, Zhang K. Zhang W, et al. BMC Genomics. 2011 Mar 23;12:157. doi: 10.1186/1471-2164-12-157. BMC Genomics. 2011. PMID: 21429208 Free PMC article.

References

1. Daubin V, Moran NA. Comment on “The origins of genome complexity”. Science. 2004;306(5698):978a. - PubMed
1. Lynch M, Conery JS. The origins of genome complexity. Science. 2003;302(5649):1401–1404. - PubMed
1. Lynch M, Conery JS. Response to comment on “The origins of genome complexity”. Science. 2004;306(5698):978. - PubMed
1. Vinogradov AE, Lynch M, Conery JS. Testing genome complexity. Science. 2004;304(5669):389b–390. - PubMed
1. Carpena P, Bernaola-Galván P, Román-Roldán R, Oliver JL. A simple and species-independent coding measure. Gene. 2002;300(1–2):97–104. - PubMed

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The random nature of genome architecture: predicting open reading frame distributions

Affiliation

The random nature of genome architecture: predicting open reading frame distributions

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous