Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Jun 28:6:160.
doi: 10.1186/1471-2105-6-160.

Multiple sequence alignments of partially coding nucleic acid sequences

Affiliations

Multiple sequence alignments of partially coding nucleic acid sequences

Roman R Stocsits et al. BMC Bioinformatics. .

Abstract

Background: High quality sequence alignments of RNA and DNA sequences are an important prerequisite for the comparative analysis of genomic sequence data. Nucleic acid sequences, however, exhibit a much larger sequence heterogeneity compared to their encoded protein sequences due to the redundancy of the genetic code. It is desirable, therefore, to make use of the amino acid sequence when aligning coding nucleic acid sequences. In many cases, however, only a part of the sequence of interest is translated. On the other hand, overlapping reading frames may encode multiple alternative proteins, possibly with intermittent non-coding parts. Examples are, in particular, RNA virus genomes.

Results: The standard scoring scheme for nucleic acid alignments can be extended to incorporate simultaneously information on translation products in one or more reading frames. Here we present a multiple alignment tool, codaln, that implements a combined nucleic acid plus amino acid scoring model for pairwise and progressive multiple alignments that allows arbitrary weighting for almost all scoring parameters. Resource requirements of codaln are comparable with those of standard tools such as ClustalW.

Conclusion: We demonstrate the applicability of codaln to various biologically relevant types of sequences (bacteriophage Levivirus and Vertebrate Hox clusters) and show that the combination of nucleic acid and amino acid sequence information leads to improved alignments. These, in turn, increase the performance of analysis tools that depend strictly on good input alignments such as methods for detecting conserved RNA secondary structure elements.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example for the higher sequence heterogeneity on the level of nucleic acids. A hypothetical amino acid alignment on top represents a high degree of similarity. See the same sequences below on the level of nucleic acids with very low sequence similarity. The pairwise identity is only 33%, just slightly above the 25% identity expected for two random nucleic acid sequences.
Figure 2
Figure 2
Application of the scoring model to a hypothetical alignment. Note that there are no amino acid contributions in the right hand part of the example because of the single indel that causes a frameshift. For illustration we show BLOSUM62 scores and simple scores for nucleic acids and gaps rather than the rescaled default values (His/Gln has score 0).
Figure 3
Figure 3
Reports on the annotated and inferred structure of the input sequences are automatically generated by codaln, respecting all user intervention.
Figure 4
Figure 4
Relative distribution of gaps in an alignment of genomic Hox4 sequences. The alignment is essentially gap-less in exon 2. ClustalW (above) returns a very poor alignment of exon 1 in which gaps occur with a broad distribution. In contrast, codaln respects the coding region so that almost all gap lengths in this area are divisible by 3.
Figure 5
Figure 5
Hogeweg mountain plots of conserved RNA structures in Levivirus genomes. Above: ClustalW, below: codaln. Colors indicate the number of consistent mutations: red 1, ochre 2, green 3, turquoise 4, blue 5; Saturated colors indicate that there are only sequences that are compatible to the structure prediction. Decreasing saturation of the colors indicates 1 or 2 non-compatible sequences. The thickness of the slabs is proportional to the average frequency of the base pair in the thermodynamic equilibrium. For further details see [3].
Figure 6
Figure 6
The 5'-terminal hairpin in Levivirus (left) is probably the analogon to the recognition signal site for the RNA replicase in Alloleviviruses which is well analyzed in Qβ (right). In Qβ the replicase amplifies RNA templates autocatalytically with high efficiency. This recognition element in Levivirus likely has a similar function.

Similar articles

Cited by

References

    1. Rivas E, Eddy SR. Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics. 2001;2:19. - PMC - PubMed
    1. Hofacker IL, Fekete M, Flamm C, Huynen MA, Rauscher S, Stolorz PE, Stadler PF. Automatic Detection of Conserved RNA Structure Elements in Complete RNA Virus Genomes. Nucl Acids Res. 1998;26:3825–3836. doi: 10.1093/nar/26.16.3825. - DOI - PMC - PubMed
    1. Hofacker IL, Stadler PF. Automatic Detection of Conserved Base Pairing Patterns in RNA Virus Genomes. Comp & Chem. 1999;23:401–414. doi: 10.1016/S0097-8485(99)00013-3. - DOI - PubMed
    1. Thurner C, Hofacker IL, Stadler PF. Conserved RNA Pseudoknots. In: Giegerich R, Stoye J, editor. Proceedings of the GCB 2004 (Bielefeld), Volume P-53 of GI-Edition: Lecture Notes in Informatics. 2004. pp. 207–216.
    1. Washietl S, Hofacker IL, Stadler PF. Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci USA. 2005;102:2454–2459. doi: 10.1073/pnas.0409169102. - DOI - PMC - PubMed

Publication types

LinkOut - more resources