Does rapid sequence divergence preclude RNA structure conservation in vertebrates?

Stefan E Seemann^{1

2}, Aashiq H Mirza^{1

3}, Claus H Bang-Berthelsen^{1

4}, Christian Garde¹, Mikkel Christensen-Dalsgaard¹, Christopher T Workman^{1

5}, Flemming Pociot^{1

3}, Niels Tommerup^{1

6}, Jan Gorodkin^{1

2}, Walter L Ruzzo^{1

7

8}

Affiliations

¹ Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Denmark.
² Department of Veterinary and Animal Sciences, University of Copenhagen, Denmark.
³ Steno Diabetes Center Copenhagen, Gentofte, Denmark.
⁴ National Food Institute, Technical University of Denmark, Kgs. Lyngby, Denmark.
⁵ Center for Biological Sequence Analysis, Technical University of Denmark, Denmark.
⁶ Department of Cellular and Molecular Medicine (ICMM), University of Copenhagen, Denmark.
⁷ Computer Science and Engineering and Genome Sciences, University of Washington, USA.
⁸ Fred Hutchinson Cancer Research Center, Seattle, USA.

PMID: 35188540
PMCID: PMC8934657
DOI: 10.1093/nar/gkac067

Does rapid sequence divergence preclude RNA structure conservation in vertebrates?

Stefan E Seemann et al. Nucleic Acids Res. 2022.

. 2022 Mar 21;50(5):2452-2463.

doi: 10.1093/nar/gkac067.

Authors

Affiliations

¹ Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Denmark.
² Department of Veterinary and Animal Sciences, University of Copenhagen, Denmark.
³ Steno Diabetes Center Copenhagen, Gentofte, Denmark.
⁴ National Food Institute, Technical University of Denmark, Kgs. Lyngby, Denmark.
⁵ Center for Biological Sequence Analysis, Technical University of Denmark, Denmark.
⁶ Department of Cellular and Molecular Medicine (ICMM), University of Copenhagen, Denmark.
⁷ Computer Science and Engineering and Genome Sciences, University of Washington, USA.
⁸ Fred Hutchinson Cancer Research Center, Seattle, USA.

PMID: 35188540
PMCID: PMC8934657
DOI: 10.1093/nar/gkac067

Abstract

Accelerated evolution of any portion of the genome is of significant interest, potentially signaling positive selection of phenotypic traits and adaptation. Accelerated evolution remains understudied for structured RNAs, despite the fact that an RNA's structure is often key to its function. RNA structures are typically characterized by compensatory (structure-preserving) basepair changes that are unexpected given the underlying sequence variation, i.e., they have evolved through negative selection on structure. We address the question of how fast the primary sequence of an RNA can change through evolution while conserving its structure. Specifically, we consider predicted and known structures in vertebrate genomes. After careful control of false discovery rates, we obtain 13 de novo structures (and three known Rfam structures) that we predict to have rapidly evolving sequences-defined as structures where the primary sequences of human and mouse have diverged at least twice as fast (1.5 times for Rfam) as nearby neutrally evolving sequences. Two of the three known structures function in translation inhibition related to infection and immune response. We conclude that rapid sequence divergence does not preclude RNA structure conservation in vertebrates, although these events are relatively rare.

PubMed Disclaimer

Figures

**Figure 1.**
Description of local neutral model and selection ratio’s correlation to covariates. (A) The local neutral model is defined by neutrally evolved ancestral repeats (AR; blue boxes) that are *local* to a feature (e.g. conserved RNA structure *CRS*; green box). Local (dark blue boxes) are the first 1000 positions of concatenated ARs around a feature. The pairwise sequence distance (d) is calculated between human and mouse for both the local neutral model of a CRS (d_LN(*CRS*)) and the CRS itself (d_F(*CRS*)). The type of selection of features is estimated by the selection ratio (SR). (B) Distribution of the sequence distance along human chromosome 1 in 100 kb windows for both CRSs (d_F(*CRS*)) and their corresponding local null (d_LN(*CRS*)) illustrates the linkage of the mutation rates on large scales. Gray vertical line indicates the centromere position. (C) Scatterplot of sequence distance of CRSs (d_F(*CRS*)) and sequence distance of corresponding local null (d_LN(*CRS*)). Points above the blue line are CRSs under negative selection and points below the red line are CRSs with rapidly evolving sequence based on our threshold definition (SR < 0.5 and SR > 2 respectively). (D) The distribution of selection ratio for CRSs (SR(*CRS*)) and individual ARs (SR(AR)). For *SR(AR)* distribution, we show one of the 10 independent samplings of ARs from the *FDR(SR)* calculation. As expected *SR(AR)* is distributed around one (note the color of *SR(AR)* is the same as for neutral selection in panel A). Dashed vertical lines mark our thresholds for structures under negative selection (SR < 0.5) and structure with rapidly evolving sequence (SR > 2). (E–G) Correlation of covariates of conserved structures to *SR(CRS)* is shown as 2d density estimation and linear regression (only SR lower than 2). (E, F) are measured from the 17 species structure alignments. The Spearman’s correlation coefficients are (E) ρ = −0.76, (F) ρ = 0.13 and (G) ρ = −0.08.

**Figure 2.**
False discovery rate of the selection ratio, i.e. *FDR(SR)*, estimation of *de novo* structures. Structures and sampled ancestral repeats (null model of neutral selection) were divided into ranges of two covariates: ‘de-gapped’ human-rhesus macaque-mouse alignment length [bp] and human G+C content. All pairwise combinations of length and G+C content ranges were applied for *FDR(SR)* estimation. For viewing the impact of the covariates on *FDR(SR)* they are separately viewed. (A) Ranges of alignment length [bp] (0–100],(100–150],(150–200],(200–300],(300–500]. (B) Ranges of human G+C content [0–0.25], (0.25–0.30], (0.30–0.35], (0.35–0.40], (0.40–0.45], (0.45–0.50], (0.50–0.55], (0.55–0.60], (0.60–1.00]. A generalized additive model (GAM) with restricted maximum likelihood (REML) parameter estimation is fitted to the data in each covariate range. As our focus is on CRSs with rapidly evolving sequence, only CRSs with SR > 2 are shown as points (621 CRSs with SR > 4.25 are not shown). *FDR(SR)* was estimated inside different ranges of SR: 41 half-open intervals of width 0.1 from (0.0–0.1] to (3.9–4.0]. One of the 10 independent samplings of ARs is shown. Supplementary Figure S11 shows the combined plot of both covariates.

**Figure 3.**
Examples of *de novo* structures with rapidly evolving sequence. Conservation patterns indicated in RNA secondary structures are based on 100 species structure based alignments after removing alignment columns with of gaps and sequences with of gaps (drawing by R2R (37)). (A) *M1716264* overlaps the long ncRNA lnc-CLEC18B-44 (hg38/chr16:73609195-73609593) and has the following properties: SR=2.9, *FDR(SR)*=0.15, GC(human) = 0.36, SI(17 species)=60.2%, Length(17 species) = 542 bp, SCI(17 species) = 0.13. (B) *M0770120* overlaps the 3’-UTR of mRNA TIPARP (hg38/chr3:156705568–156705927) and has the following properties: SR= 2.9, *FDR(SR)*= 0.15, GC(human) = 0.36, SI(17 species) = 65.4%, Length(17 species) = 387 bp, SCI(17 species) = 0.12. (C) *M0367414* is intronic of the long ncRNA LINC00871 (hg38/chr14:45954247–45954488) and has the following properties: SR= 3.3, *FDR(SR)*= 0.09, GC(human) = 0.23, SI(17 species) = 51.9%, Length(17 species) = 271 bp, SCI(17 species) = 0.16. (D) *M2048567* overlaps the processed pseudogene AC108673.1 (hg38/chr3:129046313–129046527) and has the following properties: SR= 3.6, *FDR(SR)*= 0.20, GC(human) = 0.71, SI(17 species) = 64.0%, Length(17 species) = 257 bp, SCI(17 species) = 0.30. The fitted RNA motif HL_35442.1 (36) contains a conserved trans oriented Sugar-Edge Watson–Crick basepair with both isosteric basepairs G–A and A–A occurring in the alignment, and was only found in 2% of randomly selected structures (Supplementary Methods S3).

formula image — **Figure 3.**
Examples of *de novo* structures with rapidly evolving sequence. Conservation patterns indicated in RNA secondary structures are based on 100 species structure based alignments after removing alignment columns with of gaps and sequences with of gaps (drawing by R2R (37)). (A) *M1716264* overlaps the long ncRNA lnc-CLEC18B-44 (hg38/chr16:73609195-73609593) and has the following properties: SR=2.9, *FDR(SR)*=0.15, GC(human) = 0.36, SI(17 species)=60.2%, Length(17 species) = 542 bp, SCI(17 species) = 0.13. (B) *M0770120* overlaps the 3’-UTR of mRNA TIPARP (hg38/chr3:156705568–156705927) and has the following properties: SR= 2.9, *FDR(SR)*= 0.15, GC(human) = 0.36, SI(17 species) = 65.4%, Length(17 species) = 387 bp, SCI(17 species) = 0.12. (C) *M0367414* is intronic of the long ncRNA LINC00871 (hg38/chr14:45954247–45954488) and has the following properties: SR= 3.3, *FDR(SR)*= 0.09, GC(human) = 0.23, SI(17 species) = 51.9%, Length(17 species) = 271 bp, SCI(17 species) = 0.16. (D) *M2048567* overlaps the processed pseudogene AC108673.1 (hg38/chr3:129046313–129046527) and has the following properties: SR= 3.6, *FDR(SR)*= 0.20, GC(human) = 0.71, SI(17 species) = 64.0%, Length(17 species) = 257 bp, SCI(17 species) = 0.30. The fitted RNA motif HL_35442.1 (36) contains a conserved trans oriented Sugar-Edge Watson–Crick basepair with both isosteric basepairs G–A and A–A occurring in the alignment, and was only found in 2% of randomly selected structures (Supplementary Methods S3).

**Figure 4.**
Signals of structure conservation in *de novo* and known secondary structures. (A) Structure conservation index (SCI) calculates the consistency between the structures of the individual sequences and the consensus structure in terms of minimum free energy (MFE). (B) Fraction of covarying basepairs in the annotated consensus structure. (C) Alignment power is the fraction of basepairs expected to show a significant covariation signal as calculated by R-scape. (D) Fraction of basepairs that show a significant covariation signal in the two-set statistical test (one test for annotated basepairs (bp), another for all other pairs) by R-scape (E < 0.05). We distinguish *de novo* structures with rapidly evolving sequence (*rapid CRS*: SR > 2 and *FDR(SR)*≤0.2), under negative selection (*neg CRS*: SR < 0.5 and *FDR(SR)*≤0.2), and other (*other CRS*). For comparison, Rfam (version 14.0) seed alignments (*Rfam*), their subset of vertebrate sequences (*Rfam vert*), and CMfinder predicted structure-based alignments of the human sequences in Rfam seed alignments and their homologous sequences extracted from the human (hg38) centered 100-way vertebrate MULTIZ alignments (*Rfam CMf*) were analyzed. The SCI in (A) has also been calculated for human (hg18) centered 17-way vertebrate UCSC Genome Browser alignments (MULTIZ) overlapping the human sequence of CRSs, and the human (hg38) centered 100-way MULTIZ overlapping the human sequences in Rfam seed alignments, illustrating the improved structure conservation signal in the structure-based alignments of CRSs. In (A) and (B) all 2,791 Rfam seed alignments and 831 vertebrate alignments are shown, whereas in (C) and (D) R-scape analyzed only 1966 seed alignments and 712 vertebrate alignments as for the others (including *mir-657*) the covariation in the alignment is too small (mostly due to too few sequences). The Rfam families *IRES Hsp70* (RF00495), *IFN*γ (RF00259) and *mir-657* (RF00988) with rapidly evolving sequence are indicated. If not then their values are zero, e.g. R-scape estimates expected and observed significantly covarying basepairs to be zero in the *Rfam* and *Rfam vert* alignments for all three families. *mir-657* has significant covarying bps (5 out of 9 bp) and, hence, is out of y-axis limits in (D). The median values are marked as horizontal lines. All three Rfam families with rapidly evolving sequence have exclusively vertebrate sequences in their seed alignments, hence *Rfam* and *Rfam vert* values are the same for them: *IRES Hsp70* – 12 sequences from primates and 2 from cattle (see Supplementary Figure S8), *IFN*γ – 4 from primates and 1 from cattle, and *mir-657* – 2 from primates.

See this image and copyright information in PMC

References

1. Washietl S., Hofacker I., Stadler P.. Fast and reliable prediction of noncoding RNAs. Proc. Natl. Acad. Sci. U.S.A. 2005; 102:2454–2459. - PMC - PubMed
1. Pedersen J., Bejerano G., Siepel A., Rosenbloom K., Lindblad-Toh K., Lander E., Kent J., Miller W., Haussler D.. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput. Biol. 2006; 2:e33. - PMC - PubMed
1. Yao Z., Weinberg Z., Ruzzo W.. CMfinder–a covariance model based RNA motif finding algorithm. Bioinformatics. 2006; 22:445–452. - PubMed
1. Washietl S., Hofacker I., Lukasser M., Huttenhofer A., Stadler P.. Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat. Biotechnol. 2005; 23:1383–1390. - PubMed
1. Torarinsson E., Yao Z., Wiklund E., Bramsen J., Hansen C., Kjems J., Tommerup N., Ruzzo W., Gorodkin J.. Comparative genomics beyond sequence-based alignments: RNA structures in the ENCODE regions. Genome Res. 2008; 18:242–251. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Does rapid sequence divergence preclude RNA structure conservation in vertebrates?

Affiliations

Does rapid sequence divergence preclude RNA structure conservation in vertebrates?

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources