Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 18;23(4):bbac210.
doi: 10.1093/bib/bbac210.

Three-nucleotide periodicity of nucleotide diversity in a population enables the identification of open reading frames

Affiliations

Three-nucleotide periodicity of nucleotide diversity in a population enables the identification of open reading frames

Mengyun Jiang et al. Brief Bioinform. .

Abstract

Accurate prediction of open reading frames (ORFs) is important for studying and using genome sequences. Ribosomes move along mRNA strands with a step of three nucleotides and datasets carrying this information can be used to predict ORFs. The ribosome-protected footprints (RPFs) feature a significant 3-nt periodicity on mRNAs and are powerful in predicting translating ORFs, including small ORFs (sORFs), but the application of RPFs is limited because they are too short to be accurately mapped in complex genomes. In this study, we found a significant 3-nt periodicity in the datasets of populational genomic variants in coding sequences, in which the nucleotide diversity increases every three nucleotides. We suggest that this feature can be used to predict ORFs and develop the Python package 'OrfPP', which recovers ~83% of the annotated ORFs in the tested genomes on average, independent of the population sizes and the complexity of the genomes. The novel ORFs, including sORFs, identified from single-nucleotide polymorphisms are supported by protein mass spectrometry evidence comparable to that of the annotated ORFs. The application of OrfPP to tetraploid cotton and hexaploid wheat genomes successfully identified 76.17% and 87.43% of the annotated ORFs in the genomes, respectively, as well as 4704 sORFs, including 1182 upstream and 2110 downstream ORFs in cotton and 5025 sORFs, including 232 upstream and 234 downstream ORFs in wheat. Overall, we propose an alternative and supplementary approach for ORF prediction that can extend the studies of sORFs to more complex genomes.

Keywords: SNPs; open reading frame; polyploidy genome; population; sORF.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A 3-nt periodicity is shown by nucleotide diversity in coding sequences but not in the other regions in the genomes of (A) fission yeast, (B) Arabidopsis and (C) rice. The periodicity of the nucleotide diversities in each dataset was measured by a ‘multitaper’ test shown on the right, in which a peak at 0.33 (blue dashed lines) indicates a significant (P < 0.001, cyan dashed lines) periodicity of 3-nt. The values from the first, second and third positions in each triplet were colored in cyan, orange and purple, respectively.
Figure 2
Figure 2
The workflow of OrfPP. (A) The 3-nt periodicity shown in the populational nucleotide diversity in CDSs is reminiscent of the periodicity shown in ribosome-protected footprints and (B) the workflow of OrfPP.
Figure 3
Figure 3
Recovery of annotated ORFs by OrfPP using SNPs datasets. Comparison between the ORFs predicted from SNPs and those from RPFs in the genomes of (A) fission yeast, (B) Arabidopsis and (C) rice. Overlaps between the ORFs predicted from SNPs (by OrfPP) and RPFs in (D) fission yeast, (E) Arabidopsis and (F) rice, according to which the annotated ORFs were categorized into three groups. The comparison of translation levels of genes between the three groups in (G) fission yeast, (H) Arabidopsis and (I) rice.
Figure 4
Figure 4
The prediction of sORFs in the genomes of Arabidopsis and rice. Examples of sORFs predicted from SNPs datasets of (A) Arabidopsis and (B) rice. The values from each triplet’s first, second and third positions were colored in cyan, orange and purple, respectively. Overlaps between the sORFs predicted from SNPs and RPFs in (C) Arabidopsis and (D) rice. Translation levels of the sORFs in different groups of (E) Arabidopsis and (F) rice.
Figure 5
Figure 5
Application of OrfPP in complex genomes. Examples of identified ORFs from (A) cotton and (B) wheat. Performance of OrfPP in ORF identification from (C) cotton and (D) wheat SNPs. Novel ORFs identified from SNPs of (E) cotton and (F) wheat.
Figure 6
Figure 6
MS support to novel ORFs. Comparison of MS support between (A) all the novel ORFs, (B) uORFs and (C) dORFs and annotated ORFs identified from RPFs (circles) or SNPs (triangles) in different genomes.
Figure 7
Figure 7
ORF predictions from SNPs are independent to the population size. Accessions were randomly sampled from the total SNP datasets of (A) Arabidopsis or (B) rice to generate subsets of SNPs with a population size ranging from 100 to 1000. The sampling and ORF predictions were repeated five times.

Similar articles

Cited by

References

    1. Calviello L, Mukherjee N, Wyler E, et al. Detecting actively translated open reading frames in ribosome profiling data. Nat Methods 2016;13:165–70. - PubMed
    1. Calviello L, Ohler U. Beyond read-counts: ribo-seq data analysis to understand the functions of the transcriptome. Trends Genet 2017;33:728–44. - PubMed
    1. Song B, Jiang M, Gao L. RiboNT: a noise-tolerant predictor of open reading frames from ribosome-protected footprints. Life (Basel) 2021;11:701. - PMC - PubMed
    1. Spealman P, Naik AW, May GE, et al. Conserved non-AUG uORFs revealed by a novel regression analysis of ribosome profiling data. Genome Res 2018;28:214–22. - PMC - PubMed
    1. Xiao Z, Huang R, Xing X, et al. De novo annotation and characterization of the translatome with ribosome profiling data. Nucleic Acids Res 2018;46:e61. - PMC - PubMed

Publication types