Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Nov;21(11):1916-28.
doi: 10.1101/gr.108753.110. Epub 2011 Oct 12.

Locating protein-coding sequences under selection for additional, overlapping functions in 29 mammalian genomes

Affiliations

Locating protein-coding sequences under selection for additional, overlapping functions in 29 mammalian genomes

Michael F Lin et al. Genome Res. 2011 Nov.

Abstract

The degeneracy of the genetic code allows protein-coding DNA and RNA sequences to simultaneously encode additional, overlapping functional elements. A sequence in which both protein-coding and additional overlapping functions have evolved under purifying selection should show increased evolutionary conservation compared to typical protein-coding genes--especially at synonymous sites. In this study, we use genome alignments of 29 placental mammals to systematically locate short regions within human ORFs that show conspicuously low estimated rates of synonymous substitution across these species. The 29-species alignment provides statistical power to locate more than 10,000 such regions with resolution down to nine-codon windows, which are found within more than a quarter of all human protein-coding genes and contain ∼2% of their synonymous sites. We collect numerous lines of evidence that the observed synonymous constraint in these regions reflects selection on overlapping functional elements including splicing regulatory elements, dual-coding genes, RNA secondary structures, microRNA target sites, and developmental enhancers. Our results show that overlapping functional elements are common in mammalian genes, despite the vast genomic landscape.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
(A) Examples of local synonymous rate variation in alignments of 29 placental mammals for short nine-codon windows within the open reading frames (ORFs) of three known human protein-coding genes—ALDH2, BMP4, and GRIA2—with brackets denoting starting codon position within each ORF of shown alignment. (Bright green) Synonymous substitutions with respect to the inferred ancestral sequence; (dark green) conservative amino acid substitutions; (red) other nonsynonymous substitutions. The estimated parameter λsome denotes the rate of synonymous substitution within these selected windows relative to genome-wide averages. For example, the nine-codon window starting at codon 88 of the BMP4 ORF shows λsome = 0.5, corresponding to an estimated synonymous substitution rate 50% below the genome average. (B) Variation in the estimated synonymous rate at different positions with respect to exon boundaries and translation start and stop, across all CCDS ORFs. For each class of regions, box-and-whisker plots show the observed distribution of λsome, including the median (middle horizontal bars), middle 50% range (boxes), extreme values (whiskers), and whether medians differ with high statistical confidence (nonoverlapping notches between two boxes). Estimated synonymous rates tend to be significantly reduced at the 5′ and 3′ ends of exons, and dramatically reduced in alternatively spliced exons, likely reflecting widespread splicing regulatory elements embedded within protein-coding regions.
Figure 2.
Figure 2.
Identifying individual windows with statistically significant synonymous constraint. (A) Estimated synonymous rate relative to genome average (λsome) and corresponding P-value for the hypothesis λsome < 1 evaluated in nine-codon windows along the entire protein-coding regions of ALDH2, BMP4, and GRIA2, highlighting the windows corresponding to the three examples in Figure 1. For each plot, the top portion shows the λsome estimate for each window (black curve), the genome average (red line at λs = 1), and the ORF average (blue dashed line). The bottom portion shows the statistical significance of the reduction in the synonymous rate estimate in each window, accounting for evidence in the cross-species alignments, using a likelihood ratio test for the hypothesis λsome < 1 (continuous black curve, using the genome average as the null model), and for the hypothesis λsORF < 1 (dashed black curve, using the ORF average as the null model). (Vertical gray lines) Exon boundaries; (orange) regions where λsome drops below 1/16th toward the 5′ end of BMP4 and the 3′ end of GRIA2. (B) Overall distribution of λsome estimates for all nine-codon windows across all CCDS genes. Heavy left tail indicates an excess of windows with very low estimated synonymous rates, shifting the mean (λsome = 1) to the left of the distribution mode, which likely represents neutral rates. (C) Comparison of synonymous rates estimated relative to genome-wide (λsome) and ORF-specific (λsORF) null models, each point denoting one nine-codon window, and density of overlapping points denoted by color. Joint distribution shows that low λsome estimates also usually correspond to low λsORF estimates, and therefore that the heavy tail observed in B does not reflect regional or ORF-wide deceleration, but instead localized constraints in small windows within each ORF, also visible in the three examples of A. (D) Comparison of P-values for synonymous rate reduction with respect to genome-wide (y-axis) and ORF-specific (x-axis) null models. Candidate synonymous constraint windows are selected when synonymous rate reductions are significant at P < 0.01 with respect to both null models (orange lines). Note that many windows are significant with respect to one null model but not the other. (E) Correspondence between λsome and the associated significance estimate for the each nine-codon window. The visible stripes in this plot arise from windows that are perfectly conserved except for one, two, three, or more synonymous substitutions observed in the extant species, while the position along each stripe reflects variation in the λsome estimate and its significance, determined by the species coverage, codon composition, and observed codon substitutions in each window. (B–E) The three example regions highlighted in A are shown in each distribution and density plot, with horizontal and vertical axes aligned. The orange line in plots A, D, and E denotes the statistical significance cutoff of P < 0.01, and the red line in plots A, B, C, and E denotes the genome-wide average λsome = 1 and λsORF = 1 for B. The ALDH2[103] synonymous rate is not significantly reduced either relative to the genome or to the ALDH2 ORF; BMP4[88] is reduced relative to the genome but not relative to its ORF, which shows an overall reduced rate; GRIA2[586] is >80% reduced relative to both the genome and its ORF, resulting in significant P-values for both.
Figure 3.
Figure 3.
Examples of candidate synonymous constraint elements (SCEs) with likely roles in splicing and translation regulation. (A) Predicted SCEs (light blue) overlapping two isoforms of ADAR exon 4 (black) arising from an alternative splice donor site encoded within the longer exon variant. With increasing resolution, the SCE is more precisely localized to the region of overlap with the alternative splice site (motif logo for human donor sites rendered by WebLogo) (Crooks et al. 2004). The localization of the synonymous constraint to the splice site is also seen in the local synonymous rate estimate λsORF (relative to the ORF average). Note that the significant reduction in the synonymous rate is not obvious from the nucleotide-level conservation measure (dark blue, bottom panel). The extent of the predicted SCE may suggest the presence of additional splicing regulatory elements downstream from the alternative splice site. (B) Predicted SCE (light blue) overlapping an alternate translation initiation site (green) in BRCA1 encoded within exon 9 of a longer isoform. Synonymous constraint ranges from shortly upstream to immediately downstream of the alternate start codon, suggesting this region may be involved in regulating translation initiation at the alternate site. The region just upstream of the predicted SCE also shows a reduced synonymous rate (black curve) overlapping an alternative splice donor site for a third BRCA1 isoform (gray), although this reduction is not statistically significant and the third isoform is weakly supported. Annotation visualizations in Figures 3 and 4 are based on the UCSC Genome Browser (Kent et al. 2002).
Figure 4.
Figure 4.
Synonymous constraint elements (SCEs) corresponding to dual-coding, selenocysteine insertion, and expression enhancer functions. (A) A large SCE (blue) fully encompasses a 66-codon sense/antisense dual-coding region in the convergent transcripts of THRA and NR1D1. The SCE is specifically localized to the overlapping exons, while upstream exons of each gene are excluded. (B) A predicted SCE in the selenoprotein-encoding gene SEPHS2 encompasses the selenocysteine insertion site (red) and a predicted RNA hairpin structure (minimum free energy fold rendered by VARNA) (Darty et al. 2009) immediately downstream from the selenocysteine codon. Inferred structure is similar to a hairpin known to stimulate selenocysteine recoding in SEPN1 (Howard et al. 2005). (C) Two SCEs are found within the HOXA2 ORF, each corresponding to a different enhancer element regulating expression of the mouse ortholog in distinct segments of the developing hindbrain. The 5′ element encodes a HOX-PBX responsive element and drives expression in rhombomere 4 (Lampe et al. 2008), and the 3′ element encodes SOX2 binding sites and drives expression in rhombomere 2 (Tümpel et al. 2008). The 3′ element includes several RTE and ACAAT motif instances that were investigated by site-directed mutagenesis in the previous study (red), as well as two additional upstream instances (green). SCEs are also found within most other HOX genes.

Comment in

References

    1. Ahmed ZM, Masmoudi S, Kalay E, Belyantseva IA, Mosrati MA, Collin RWJ, Riazuddin S, Hmani-Aifa M, Venselaar H, Kawar MN, et al. 2008. Mutations of LRTOMT, a fusion gene with alternative reading frames, cause nonsyndromic deafness in humans. Nat Genet 40: 1335–1340 - PMC - PubMed
    1. Anisimova M, Kosiol C 2009. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 26: 255–271 - PubMed
    1. Anisimova M, Bielawski JP, Yang Z 2001. Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol Biol Evol 18: 1585–1592 - PubMed
    1. Aruscavage PJ, Bass BL 2000. A phylogenetic analysis reveals an unusual sequence conservation within introns involved in RNA editing. RNA 6: 257–269 - PMC - PubMed
    1. Baek D, Green P 2005. Sequence conservation, relative isoform frequencies, and nonsense-mediated decay in evolutionarily conserved alternative splicing. Proc Natl Acad Sci 102: 12813–12818 - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources