Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 24;38(Suppl 1):i255-i263.
doi: 10.1093/bioinformatics/btac247.

On the reliability and the limits of inference of amino acid sequence alignments

Affiliations

On the reliability and the limits of inference of amino acid sequence alignments

Sandun Rajapaksa et al. Bioinformatics. .

Abstract

Motivation: Alignments are correspondences between sequences. How reliable are alignments of amino acid sequences of proteins, and what inferences about protein relationships can be drawn? Using techniques not previously applied to these questions, by weighting every possible sequence alignment by its posterior probability we derive a formal mathematical expectation, and develop an efficient algorithm for computation of the distance between alternative alignments allowing quantitative comparisons of sequence-based alignments with corresponding reference structure alignments.

Results: By analyzing the sequences and structures of 1 million protein domain pairs, we report the variation of the expected distance between sequence-based and structure-based alignments, as a function of (Markov time of) sequence divergence. Our results clearly demarcate the 'daylight', 'twilight' and 'midnight' zones for interpreting residue-residue correspondences from sequence information alone.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
An illustration of the distance between two alignments measured as the area between two source-to-sink paths
Fig. 2.
Fig. 2.
First, second and third quartile statistics of the expected distance of sequence alignments with respect to a reference alignment obtained using MMLigner (Collier et al., 2017) program as a function of the inferred Markov time parameter, timemarginal. The three columns correspond to the three time-parameterized models employed during the MML sequence comparison: (a) BLOSUM, (b) VTML and (c) MMLSUM (Sumanaweera et al., 2020). In each plot, the x-axis (shown below the box) is the Markov time step in the range [1, 500]; the range on the top of the plot is the corresponding expected %-change of amino acids for the chosen time-parameterized model; the y-axis (left of the box) is the expected distance; the scatter plots (magenta, green and blue) track the changes in the first, second and third quartile statistics of the expected distance statistic over 1 million domain pairs, grouped according to their inferred integer Markov time step (timemarginal); the vertical range on the right of the box tracks the cumulative %-growth of the number of domain pairs as a function of time on the x-axis
Fig. 3.
Fig. 3.
Expected distance versus Markov time plots, after separating the million domain pairs into distinct groups based on their compression statistics—see Table 1. (a) The variation of expected distance of sequence alignment on the subset of the domain pairs where the optimal alignment model beats the null (Δoptimal>0). (b) Same as above, but corresponding to the subset of the domain pairs for which the marginal beats the null but not the optimal (Δoptimal0<Δmarginal). (c) As above, but for the remaining domain pairs where neither the optimal nor the marginal beats the null
Fig. 4.
Fig. 4.
Visualization of the marginal probability matrices/landscapes generated after comparing the amino acid sequence of human hemoglobin (1HHO chain A) with the sequences of six other homologous globins. The color-codes denote the negative logarithm of the product of marginal probabilities that the prefixes and the suffixes are related (see Section 2.2). The colors within each matrix vary between the range of its [min,max] matrix values

References

    1. Allison L. (2018) Coding Ockham’s Razor. Springer, Cham, Switzerland.
    1. Barton G.J., Sternberg M.J. (1987) Evaluation and improvements in the automatic alignment of protein sequences. Protein Eng., 1, 89–94. - PubMed
    1. Blake J.D., Cohen F.E. (2001) Pairwise sequence alignment below the twilight zone. J. Mol. Biol., 307, 721–735. - PubMed
    1. Bujnicki J.M. (2003) Crystallographic and bioinformatic studies on restriction endonucleases: inference of evolutionary relationships in the “midnight zone” of homology. Curr. Protein Pept. Sci., 4, 327–337. - PubMed
    1. Cheng H. et al. (2014) ECOD: an evolutionary classification of protein domains. PLoS Comput. Biol., 10, e1003926. - PMC - PubMed