Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov 1;14(11):e1006547.
doi: 10.1371/journal.pcbi.1006547. eCollection 2018 Nov.

Motif-Aware PRALINE: Improving the alignment of motif regions

Affiliations

Motif-Aware PRALINE: Improving the alignment of motif regions

Maurits Dijkstra et al. PLoS Comput Biol. .

Abstract

Protein or DNA motifs are sequence regions which possess biological importance. These regions are often highly conserved among homologous sequences. The generation of multiple sequence alignments (MSAs) with a correct alignment of the conserved sequence motifs is still difficult to achieve, due to the fact that the contribution of these typically short fragments is overshadowed by the rest of the sequence. Here we extended the PRALINE multiple sequence alignment program with a novel motif-aware MSA algorithm in order to address this shortcoming. This method can incorporate explicit information about the presence of externally provided sequence motifs, which is then used in the dynamic programming step by boosting the amino acid substitution matrix towards the motif. The strength of the boost is controlled by a parameter, α. Using a benchmark set of alignments we confirm that a good compromise can be found that improves the matching of motif regions while not significantly reducing the overall alignment quality. By estimating α on an unrelated set of reference alignments we find there is indeed a strong conservation signal for motifs. A number of typical but difficult MSA use cases are explored to exemplify the problems in correctly aligning functional sequence motifs and how the motif-aware alignment method can be employed to alleviate these problems.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. The problem of aligning motifs in variable regions: The HIV envelope protein (ENV) sequence [5].
The bottom strip shows an overview of the first half of the alignment. At the top, one of the variable (V1, left) and constant (C3, right) regions are shown. N-linked glycosylation is crucial for the immune evasion function of ENV; hence there is a prevalence of the N-{P}-[ST]-{P} motifs throughout. In C3 (right), the motifs are clearly well-aligned; however in the variable V1 (left), even though conserved flanking regions give a convenient anchoring for the alignment, it is obvious that the motifs are poorly aligned. This alignment was created with Clustal Omega [6]. For clarity, only a subset of representative sequences (one per patient) is shown here. Figures were created using Jalview [18]; the presence of N-terminal glycosylation motifs is shown in yellow.
Fig 2
Fig 2. Example showing the influence of α on an alignment.
Positions shaded in gray show amino acids which are part of a motif pattern match (the pattern A-A in this example). The path taken through the dynamic programming matrix is shown shaded in red. BLOSUM62 was used to score amino acid substitutions, together with a gap open penalty of −11 and a gap extension penalty of −1 (or, g(l) = −11 + (−1 * (l − 1)) for l > 0 and g(l) = 0 for l = 0).
Fig 3
Fig 3. An explanation of the motif matching rules and the motif annotation process.
A: N-terminal glycosylation site motif pattern, with an explanation of its match rules. Informative match rules are shown in bold. B: Example sequence (top row) matched against the N-terminal glycosylation site pattern (middle row). The bottom row shows whether a position is annotated as a motif match (M) or not (*).
Fig 4
Fig 4. The estimate of the motif match score α*, derived from the HOMSTRAD set of reference alignments.
Every point corresponds to a single motif pattern found in PROSITE that has at least one match in HOMSTRAD. qm is the fraction of amino acids matching a pattern versus the total number of amino acids in all HOMSTRAD alignments containing at least one instance of the pattern. Short patterns are defined as having fewer than 6 match rules, average length patterns as having between 6 and 20, and long patterns as having more than 20. The dashed line indicates the α* limit for perfect motif conservation; the four points that most prominently fail to obey the limit correspond to PROSITE entries PS00022, PS00142, PS00370 and PS00589.
Fig 5
Fig 5. Incremental improvements for the difficult variable region V1 in the HIV ENV protein.
Only the V1 region is shown in this figure; in cases where the full protein sequence was aligned, V1 was extracted afterwards using an alignment editor. Alignments were made using: A: regular PRALINE, aligning the entire protein; B: MA-PRALINE (α = 20), aligning the entire protein; C: regular PRALINE, aligning just V1; D: MA-PRALINE (α = 20), aligning just V1. Figures were created using Jalview [18]; the presence of N-terminal glycosylation motifs is shown in yellow.
Fig 6
Fig 6. Alignment programs compared on the motif-rich region of the BAliBASE 3 reference alignment set of cupredoxin proteins (BB30015).
Yellow colored residues match the copper binding motif (with PROSITE identifier PS00196). Lighter colors indicate non-informative residues within a motif match. Alongside the program output a reference structure is shown (PDB identifier 1AAC) to visualize the motif within the structural context of the protein family. Note that the proline residue at position 94 in the shown structure is misaligned in the BAliBASE reference. The colors used here are the same as in the alignment outputs. The ligand from the PDB structure is shown in red.
Fig 7
Fig 7. Alignment programs compared on the motif rich region of the BAliBASE 3 reference alignment set of nitrate reductase enzymes (BB20035).
Residues colored yellow match against the eukaryotic molybdopterin oxidoreductase motif (PROSITE identifier PS00559). Residues colored green match against the cytochrome b5 heme-binding motif (PROSITE identifier PS00191). Lighter colors indicate non-informative residues within a motif match. Alongside the program output a reference structure is shown for each motif matching region (PDB identifiers: 2BIH for the oxidoreductase motif; 4B8N for the heme-binding motif) to visualize the motif within the structural context of the protein family. The colors used here are the same as in the alignment outputs. The ligand from the PDB structure is shown in red.

References

    1. Bork P, Gibson TJ. Applying motif and profile searches. Methods in enzymology. 1996;266:162 10.1016/S0076-6879(96)66013-3 - DOI - PubMed
    1. Bork P, Koonin EV. Protein sequence motifs. Current opinion in structural biology. 1996;6(3):366–376. 10.1016/S0959-440X(96)80057-1 - DOI - PubMed
    1. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences. 1992;89(22):10915–10919. 10.1073/pnas.89.22.10915 - DOI - PMC - PubMed
    1. Dayhoff M, Schwartz R, Orcutt B. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Springs, MD, suppl. 1978;5:345–352.
    1. van den Kerkhof T, Feenstra KA, Euler Z, Van Gils MJ, Rijsdijk L, Boeser-Nunnink BD, et al. HIV-1 envelope glycoprotein signatures that correlate with the development of cross-reactive neutralizing activity. Retrovirology. 2013;10(1):102 10.1186/1742-4690-10-102 - DOI - PMC - PubMed