Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr 20;46(7):3326-3338.
doi: 10.1093/nar/gky188.

A comprehensive catalog of predicted functional upstream open reading frames in humans

Affiliations

A comprehensive catalog of predicted functional upstream open reading frames in humans

Patrick McGillivray et al. Nucleic Acids Res. .

Abstract

Upstream open reading frames (uORFs) latent in mRNA transcripts are thought to modify translation of coding sequences by altering ribosome activity. Not all uORFs are thought to be active in such a process. To estimate the impact of uORFs on the regulation of translation in humans, we first circumscribed the universe of all possible uORFs based on coding gene sequence motifs and identified 1.3 million unique uORFs. To determine which of these are likely to be biologically relevant, we built a simple Bayesian classifier using 89 attributes of uORFs labeled as active in ribosome profiling experiments. This allowed us to extrapolate to a comprehensive catalog of likely functional uORFs. We validated our predictions using in vivo protein levels and ribosome occupancy from 46 individuals. This is a substantially larger catalog of functional uORFs than has previously been reported. Our ranked list of likely active uORFs allows researchers to test their hypotheses regarding the role of uORFs in health and disease. We demonstrate several examples of biological interest through the application of our catalog to somatic mutations in cancer and disease-associated germline variants in humans.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
(A) Structure of upstream open reading frames. The stop codon of an uORF may be located before the CDS start codon [top], or downstream of the CDS start codon if the uORF is frame-shifted relative to the CDS [middle]. If the uORF and CDS share the same stop codon, the uORF acts as a 5′ extension of the CDS [bottom]. (B) Effect of mutation or variation on upstream open reading frames. Creation or destruction of an upstream open reading may have a downstream effect on translation of the coding sequence. Change in the translation of the coding sequence may result in a change in phenotype and disease risk. (C) Sensitivity and specificity of ribosome profiling for identifying upstream open reading frames. It is possible that ribosome profiling studies have a high false negative rate (left), or a high false positive rate (right). We make the assumption that ribosome profiling studies have a high false negative rate for identifying translated upstream open reading frames (left). (D) The activity of uORFs varies according to cell type and environmental stimuli. uORFs may not be detected in a ribosome profiling experiment due to variation in uORF activity with cell type and cell environment.
Figure 2.
Figure 2.
(A) Methodology for distinguishing positive from unlabeled uORFs. uORFs identified through genome-wide scan and uORFs labeled in ribosome profiling experiments were used to train a machine learning algorithm to identify uORFs that are likely active (positive predictions). (B) Examples of differential distributions of attributes between positive and unlabeled uORFs. uORF attributes are used to distinguish positive from unlabeled uORFs. Continuous distributions were discretized and optimized for machine learning using the minimum description length principle (MDLP) binning algorithm. Horizontal lines on the plot correspond to these binning intervals. (C) Upstream open reading frame attribute ranking. Attributes are ranked according to the difference in distribution between positive and unlabeled uORFs using the KS statistic. The top 15 features according to this prioritization are shown.
Figure 3.
Figure 3.
(A) Ribosome profiling identified uORFs as a subset of all uORFs. The universe of all uORFs is identified through comprehensive search of the GENCODE human genome annotation [outer border]. Ribosome profiling studies of Fritsch et al., Lee et al., and Gao et al. are shown as overlapping subsets of this universe. Pair-wise and three-way intersections between these experiments are highlighted. (B) Frequency of translated uORF ATG start codons and near-cognate start codons from ribosome profiling experiments. Frequency for uORFs translated in any experiment (union) or in more than one experiment (intersection). (C) Score distributions for upstream open reading frames. Score distributions for 2-voted positive uORFs that are translated in two or more ribosome profiling experiments (top), 1-voted positive uORFs that are translated in only one ribosome profiling experiment (middle), and unlabeled uORFs uncovered through genome-wide search (bottom). (D) The frequency of uORF ATG start codons and near-cognate start codons of predicted positive upstream open reading frames. Frequency is given for all uORFs genome-wide and for the subset of uORFs that are predicted to be active (predicted positive). (E) uORFs predicted as positive from genome-wide scan and ribosome profiling experiments. Approximately 180,000 uORFs in the genome are predicted as active upstream open reading frames. This large set includes substantial proportions of uORFs identified in the ribosome profiling experiments (∼70% each). (F) Performance of the machine learning algorithm. The machine learning algorithm was trained on two of three ribosome profiling data sets and used to extract the third data set from among unlabeled examples. The ROC curve is shown for each of the three combinations: (i) train Lee et al. and Gao et al.—extract Fritsch et al. (AUC = 0.77), (ii) train Fritsch et al. and Gao et al.—extract Lee et al. (AUC = 0.82), (iii) train Lee et al. and Fritsch et al.—extract Gao et al. (AUC = 0.79),
Figure 4.
Figure 4.
(A) Gene level protein expression change for individuals with variants interrupting predicted positive uORFs. The work of Battle et al. includes proteomic measurements for 46 individuals with whole genome variant calling through the 1000 Genomes Project. For these individuals, uORF gain is associated with increased protein levels from the downstream gene, while uORF loss is associated with decreased protein levels. (B) rQTLs interrupting uORFs according to the score of the corresponding uORF. rQTLs identified by Battle et al. display a tendency to hit predicted positive uORFs. (C) Density matrix showing the distribution of 1000 Genomes variants that interrupt predicted positive uORF start codons. The vertical axis displays the reference start codon, and the horizontal axis shows the interrupting variant (position—1, 2, 3 – and codon—A, T, G, C). (D) Density matrix showing the distribution of somatic mutations found in exomic tumor samples that interrupt predicted positive uORF start codons. The vertical axis displays the reference start codon, the horizontal axis shows the interrupting variant (position—1, 2, 3—and codon—A, T, G, C). ATG forming mutations are highlighted.

References

    1. Kozak M. An analysis of 5′-noncoding sequences from 699 vertebrate messenger RNAs. Nucleic Acids Res. 1987; 15:8125–8148. - PMC - PubMed
    1. Kochetov A.V., Sarai A., Rogozin I.B., Shumny V.K., Kolchanov N.A.. The role of alternative translation start sites in the generation of human protein diversity. Mol. Genet. Genomics. 2005; 273:491–496. - PubMed
    1. Ingolia N.T., Lareau L.F., Weissman J.S.. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 2011; 147:789–802. - PMC - PubMed
    1. Ingolia N.T., Ghaemmaghami S., Newman J.R.S., Weissman J.S.. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science. 2009; 324:218–23. - PMC - PubMed
    1. Ivanov I.P., Loughran G., Atkins J.F.. uORFs with unusual translational start codons autoregulate expression of eukaryotic ornithine decarboxylase homologs. Proc. Natl. Acad. Sci. U.S.A. 2008; 105:10079–10084. - PMC - PubMed

Publication types