Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 3;41(7):msae117.
doi: 10.1093/molbev/msae117.

COATi: Statistical Pairwise Alignment of Protein-Coding Sequences

Affiliations

COATi: Statistical Pairwise Alignment of Protein-Coding Sequences

Juan José García Mesa et al. Mol Biol Evol. .

Abstract

Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequencing artifacts and errors made during genome assembly, such as abiological frameshifts and incorrect early stop codons, can impact downstream analyses leading to erroneous conclusions in comparative and functional genomic studies. More significantly, while indels can occur both within and between codons in natural sequences, most amino-acid- and codon-based aligners assume that indels only occur between codons. This mismatch between biology and alignment algorithms produces suboptimal alignments and errors in downstream analyses. To address these issues, we present COATi, a statistical, codon-aware pairwise aligner that supports complex insertion-deletion models and can handle artifacts present in genomic data. COATi allows users to reduce the amount of discarded data while generating more accurate sequence alignments. COATi can infer indels both within and between codons, leading to improved sequence alignments. We applied COATi to a dataset containing orthologous protein-coding sequences from humans and gorillas and conclude that 41% of indels occurred between codons, agreeing with previous work in other species. We also applied COATi to semiempirical benchmark alignments and find that it outperforms several popular alignment programs on several measures of alignment quality and accuracy.

Keywords: coding sequences; codon models; indel phases; pairwise alignment; statistical alignment.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Fig. 1.
Fig. 1.
Standard algorithms produce suboptimal alignments. a) Shows the true alignment of an ancestral sequence (A) and a descendant sequence (D). b) to d) Are the results of different aligners. Nucleotide mismatches are highlighted in red. Notably, COATi is the only aligner able to retrieve the biological alignment in this example. Indels in protein-coding sequences can be classified as having one of three different phases and being one of two different types. Phases refer to the location of the gap with respect to the reading frame, while types refer to the consequence of the indel. Phase-1, phase-2, and phase-3 indels are shown in blue, orange, and green, respectively. Additionally, the orange indel is type-II (an amino-acid indel plus an amino-acid change) while the blue indel is type-I (an amino-acid indel only). The difference between an in-frame and a frameshift indel is not displayed.
Fig. 2.
Fig. 2.
FSTs model the generation of an output sequence based on an input sequence. a) A graph of a probabilistic FST (Cotterell et al. 2014) for base-calling errors using a Mealy-machine architecture, where parameter u is the error rate. This graph contains two states (S and M) connected by arcs, with labels “input symbols : output symbols/weight.” Arcs consume symbols from the input sequence and emit symbols to the output sequence. Weights describe the probability that an arc is taken given the input symbols. Epsilon (ε ) is a special symbol denoting that no symbols were either consumed or emitted. b) An FST for matching sequences against ambiguous nucleotides (N). c) An FST that results from the composition (° operation) of the Error FST with the Ambiguity FST.
Fig. 3.
Fig. 3.
The COATi FST is built from simpler FSTs via composition. a) The substitution FST encodes a 61×61 codon substitution model with 3721 arcs from S to M. These arcs consume three nucleotides from the input tape and emit three nucleotides to the output tape. The weight of each arc is a conditional probability derived from a codon substitution model. See Fig. 2 for more details about reading this graph. b) The indel FST allows for insertions (H to I) and deletions (C to D). Here g is the gap-opening parameter and e is the gap-extension parameter. Insertion arcs are weighted according to the codon model’s stationary distribution of nucleotides, and deletion arcs have a weight of 1. This FST is structured such that if insertions and deletions are contiguous, insertions will precede deletions (cf. Holmes and Bruno 2001; De Maio 2021). c) The COATi FST is derived via composition from the codon substitution, indel, error, and ambiguity FSTs.
Fig. 4.
Fig. 4.
COATi’s alignments produce biologically reasonable evolutionary distances. a) The distribution of K2P distances inferred from alignments generated by each method. The averages of each distribution are indicated by vertical lines. The averages are COATi=0.0084, ClustalΩ=0.0178, MACSE=0.0149, MAFFT=0.0154, PRANK=0.0092, and COATi-rev=0.0084. b) The distribution of the differences between distances inferred by COATi and other methods. The x-axes of both plots have been pseudo-log transformed using the inverse hyperbolic sine.
Fig. 5.
Fig. 5.
Aligners varied in which sequence pairs they identified as undergoing positive selection. In this UpSet plot, the bottom panel displays the 16 most frequent intersection patterns among aligners. A black circle represents positive selection. The most frequent pattern was that no aligner found positive selection while the second most frequent pattern was that all aligners found positive selection. Other patterns involved a disagreement between aligners about whether a sequence pair showed evidence of positive or negative selection. The top panel displays the number of sequence pairs in each grouping.
Fig. 6.
Fig. 6.
COATi’s alignments were closer to the semiempirical benchmark dataset than other methods according to a PCoA of the average alignment distance (dseq) between alignments generated by different methods.

References

    1. Abascal F, Zardoya R, Telford MJ. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. 2010:38(suppl_2):W7–W13. 10.1093/nar/gkq291. - DOI - PMC - PubMed
    1. Allauzen C, Riley M, Schalkwyk J, Skut W, Mohri M. OpenFst: a general and efficient weighted finite-state transducer library. In: Holub J, Žďárek J, editors. Implementation and application of automata. Berlin, Heidelberg: Springer; 2007. p. 11–23. 10.1007/978-3-540-76336-9_3. - DOI
    1. Arvestad L. Aligning coding DNA in the presence of frame-shift errors. In: Apostolico A, Hein J, editors. Combinatorial pattern matching. Berlin, Heidelberg: Springer; 1997. p. 180–190. 10.1007/3-540-63220-4_59. - DOI
    1. Bininda-Emonds O. transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC Bioinformatics. 2005:6(1):1–6. 10.1186/1471-2105-6-156. - DOI - PMC - PubMed
    1. Blackburne BP, Whelan S. Measuring the distance between multiple sequence alignments. Bioinformatics. 2011:28(4):495–502. 10.1093/bioinformatics/btr701. - DOI - PubMed

Grants and funding

LinkOut - more resources