Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 4;5(1):6.
doi: 10.1186/1748-7188-5-6.

Back-translation for discovering distant protein homologies in the presence of frameshift mutations

Back-translation for discovering distant protein homologies in the presence of frameshift mutations

Marta Girdea et al. Algorithms Mol Biol. .

Abstract

Background: Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level.

Results: We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. Our implementation is freely available at [http://bioinfo.lifl.fr/path/].

Conclusions: Our approach allows to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Back-translation graph examples. A fully represented (a) and condensed (b) back-translation graph for the amino acid sequence YSH.
Figure 2
Figure 2
Obtaining a simple back-translation graph for the amino acid R. The construction of a simple back-translation graph, for the amino acid R, encoded by 6 codons, is illustrated here. Note that identical nucleotides are associated to different nodes if they have different prefixes in the codons where they appear.
Figure 3
Figure 3
Example of reverse complementary back-translation graphs for the amino acid sequence Y SH. The reverse complementary of a back-translation graph can be obtained in a classic manner, by reversing the arcs and complementing the nucleotide symbols that label the nodes.
Figure 4
Figure 4
Alignment example. A path (corresponding to a putative DNA sequence) was chosen from each graph so that the match/mismatch ratio is maximized.
Figure 5
Figure 5
Example of dynamic programming matrix M. M [i, j] is a "cell" of M corresponding to position i of the first graph and position j of the second graph. M [i, j] contains entries (αi, βj) corresponding to pairs of nodes occurring in the first graph at position i, and in the second graph at position j, respectively.
Figure 6
Figure 6
Sequence divergence by frameshift mutation. Two proteins are encoded on the same DNA sequence, on different reading frames; at some point, the sequence was duplicated and the two copies diverged independently; we assume that the two coding sequences undergo, in their independent evolution, synonymous and non-synonymous point mutations, or full codon insertions and removals.
Figure 7
Figure 7
Yersinia pestis: transposases. The alignment of two transposase variants from Yersinia pestis: [GenBank:167423046] - subsequence 4-167 of the back-translation, and [GenBank:EDR63673.1] - subsequence 225-389 of the back-translation. The frameshift mutation at position 115/336 corrects the reading frame. The frameshifted alignment fragment has an E-value of 10-7.
Figure 8
Figure 8
Xylella fastidiosa: glucosidases. Two β-glucosidase variants from Xylella fastidiosa: [GenBank:AAO29662.1] - subsequence 202-2645 of the back-translation, and [GenBank:EAO32640.1] - subsequence 2-2444 of the back-translation. We only show in this image a fragment of the full alignment (the first 239 base pairs). The second part is not particularly interesting in our context because the sequences are aligned on the same reading frame, with a very small number of mismatches. In the first part, the sequences are aligned with a reading frame difference that is corrected starting with positions 304/104. The frameshifted alignment fragment has an E-value of 10-8.
Figure 9
Figure 9
Elapidae: neurotoxins (1). Two presynaptic neurotoxins from two higher snakes of the Elapidae family (Bungarus candidus and Naja kaouthia): [Swiss-Prot:Q8AY47.1] - subsequence 64-407 of the back-translation, and [PIR:PSNJ2K] - subsequence 13-354 of the back-translation. The sequences are aligned on the same reading frame up to position 186/135, and on a +1 reading frame from that point forward. The frameshifted fragment has an E-value of 10-9.
Figure 10
Figure 10
Elapidae: neurotoxins (2). Two Bungarus candidus proteins, very similar at the DNA level ([Swiss-Prot:Q8AY47.1] and [Swiss-Prot:Q8AY48.1]). From the first 94 amino acid pairs, only 4 present mismatches (which are transitions at the coding DNA level). A frameshift mutation is visible at position 284 of the back-translated sequences. The fragments following it are almost perfectly aligned with a frameshift, with an E-value of 10-9.
Figure 11
Figure 11
Elapidae: neurotoxins (3). Two presynaptic neurotoxins from two higher snakes of the Elapidae family ([DDBJ:BAA75760.1] of Laticauda colubrina and [DDBJ:BAC78208.1] of Laticauda laticaudata): It shows that the unidentified peptide is in fact an alternative splicing (or frameshifted) variant of the neurotoxin. The frameshifted fragment has an E-value of 10-10.
Figure 12
Figure 12
Platelet-derived growth factor proteins. The alignment of the back-translated platelet-derived growth factor proteins from Homo sapiens and Ratus sp ([Swiss-Prot:P04085.1] and [DDBJ:BAA00987.1]). The two proteins share high similarity at the amino acid level on the subsequences 1-84 and 113-195. The amino acids 85-112 can be easily aligned with a frameshift, with an E-value of 10-6. Both the "inducing" and "correcting" frameshifts are located on two different exons.
Figure 13
Figure 13
Classic protein alignment of the platelet-derived growth factor proteins. The classic protein alignment of the platelet-derived growth factor proteins from Homo sapiens and Ratus sp ([Swiss-Prot:P04085.1] and [DDBJ:BAA00987.1]) shows very little amino acid similarity between the 85-112 subsequences, that we have successfully aligned on a +1 frameshift.

Similar articles

Cited by

References

    1. Raes J, Peer Y Van de. Functional divergence of proteins through frameshift mutations. Trends in Genetics. 2005;21(8):428–431. doi: 10.1016/j.tig.2005.05.013. - DOI - PubMed
    1. Okamura K, Feuk L, Marquès-Bonet T, Navarro A, Scherer S. Frequent appearance of novel protein-coding sequences by frameshift translation. Genomics. 2006;88(6):690–697. doi: 10.1016/j.ygeno.2006.06.009. - DOI - PubMed
    1. Harrison P, Yu Z. Frame disruptions in human mRNA transcripts, and their relationship with splicing and protein structures. BMC Genomics. 2007;8:371. doi: 10.1186/1471-2164-8-371. - DOI - PMC - PubMed
    1. Hahn Y, Lee B. Identification of nine human-specific frameshift mutations by comparative analysis of the human and the chimpanzee genome sequences. Bioinformatics. 2005;21(Suppl 1):i186–i194. doi: 10.1093/bioinformatics/bti1000. - DOI - PubMed
    1. Altschul S, Gish W, Miller W, Myers E, Lipman D. Basic local alignment search tool. JMB. 1990;215(3):403–410. - PubMed

LinkOut - more resources