Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jul 30;4(7):e6456.
doi: 10.1371/journal.pone.0006456.

The random nature of genome architecture: predicting open reading frame distributions

Affiliations

The random nature of genome architecture: predicting open reading frame distributions

Michael W McCoy et al. PLoS One. .

Abstract

Background: A better understanding of the size and abundance of open reading frames (ORFS) in whole genomes may shed light on the factors that control genome complexity. Here we examine the statistical distributions of open reading frames (i.e. distribution of start and stop codons) in the fully sequenced genomes of 297 prokaryotes, and 14 eukaryotes.

Methodology/principal findings: By fitting mixture models to data from whole genome sequences we show that the size-frequency distributions for ORFS are strikingly similar across prokaryotic and eukaryotic genomes. Moreover, we show that i) a large fraction (60-80%) of ORF size-frequency distributions can be predicted a priori with a stochastic assembly model based on GC content, and that (ii) size-frequency distributions of the remaining "non-random" ORFs are well-fitted by log-normal or gamma distributions, and similar to the size distributions of annotated proteins.

Conclusions/significance: Our findings suggest stochastic processes have played a primary role in the evolution of genome complexity, and that common processes govern the conservation and loss of functional genomics units in both prokaryotes and eukaryotes.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Fits of the 2 mixture models (Eqs. 1–2) to the genomes of three representative taxa.
(a) Escherichia coli, a prokaryote, (b) Yarrowia lypolytica, a unicellular eukaryote, and (c) Drosophila melanogaster, a multicellular eukaryote.
Figure 2
Figure 2. The size distributions of small ORFs in 311 whole genomes of prokaryotes and eukaryotes are consistent with random expectations (each point represents a genome).
Observed values obtained by fitting the exponential components of the mixture models (formula image in Eqs. 1–2) were linearly related to the expected value for a random sequence of a given GC content, formula image, with a slope statistically indistinguishable from 1 and an intercept near 0 (P>0.05, r2 = 0.92).
Figure 3
Figure 3. The relationship between ORFs and genome characteristics.
Panels a and b: show the relationships between the total number of random ORFs versus genome size and GC content. Panels c and d: Show the relationships between the fraction of all ORFs that are randomly generated (p in Eqs. 1–2) versus genome size and GC content. Data were fitted using generalized additive models with non-parametric smoothing functions. Dashed lines represent 95% point wise confidence intervals.
Figure 4
Figure 4. The relationship between parameters estimated from the mixture models and annotated.
Panels a and b: Show the relationships between the parameters μ and σ of the lognormal distribution estimated from the mixture model fits with the μ and σ parameters estimated from fits to annotated proteins. Panels c and d: Show the relationships between parameters α and β of the gamma distribution estimated from the mixture model fits with the α and β parameters estimated from fits to annotated proteins. Data were fitted using analysis of covariance.
Figure 5
Figure 5. Relationship between the numbers of non-random ORFs based on the mixture model fits with the number of annotated proteins for 311 prokaryotic and eukaryotic genomes.
Best fit lines determined from Analysis of Covariance where the dashed line represents the fit for prokaryotes, and the solid line represents the fit for eukaryotes.

Similar articles

Cited by

References

    1. Daubin V, Moran NA. Comment on “The origins of genome complexity”. Science. 2004;306(5698):978a. - PubMed
    1. Lynch M, Conery JS. The origins of genome complexity. Science. 2003;302(5649):1401–1404. - PubMed
    1. Lynch M, Conery JS. Response to comment on “The origins of genome complexity”. Science. 2004;306(5698):978. - PubMed
    1. Vinogradov AE, Lynch M, Conery JS. Testing genome complexity. Science. 2004;304(5669):389b–390. - PubMed
    1. Carpena P, Bernaola-Galván P, Román-Roldán R, Oliver JL. A simple and species-independent coding measure. Gene. 2002;300(1–2):97–104. - PubMed