. 2019 Nov:92:180-191.

doi: 10.1016/j.jmgm.2019.07.014. Epub 2019 Jul 26.

Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling

Dylan J T Mead¹, Simón Lunagomez², Derek Gatherer³

Affiliations

¹ Division of Biomedical & Life Sciences, Faculty of Health & Medicine, Lancaster University, Lancaster, LA1 4YT, UK. Electronic address: dylanmead.dm@googlemail.com.
² Department of Mathematics & Statistics, Lancaster University, Lancaster, LA1 4YF, UK. Electronic address: s.lunagomez@lancaster.ac.uk.
³ Division of Biomedical & Life Sciences, Faculty of Health & Medicine, Lancaster University, Lancaster, LA1 4YT, UK. Electronic address: d.gatherer@lancaster.ac.uk.

PMID: 31377535
PMCID: PMC7110651
DOI: 10.1016/j.jmgm.2019.07.014

Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling

Dylan J T Mead et al. J Mol Graph Model. 2019 Nov.

. 2019 Nov:92:180-191.

doi: 10.1016/j.jmgm.2019.07.014. Epub 2019 Jul 26.

Authors

Dylan J T Mead¹, Simón Lunagomez², Derek Gatherer³

Affiliations

¹ Division of Biomedical & Life Sciences, Faculty of Health & Medicine, Lancaster University, Lancaster, LA1 4YT, UK. Electronic address: dylanmead.dm@googlemail.com.
² Department of Mathematics & Statistics, Lancaster University, Lancaster, LA1 4YF, UK. Electronic address: s.lunagomez@lancaster.ac.uk.
³ Division of Biomedical & Life Sciences, Faculty of Health & Medicine, Lancaster University, Lancaster, LA1 4YT, UK. Electronic address: d.gatherer@lancaster.ac.uk.

PMID: 31377535
PMCID: PMC7110651
DOI: 10.1016/j.jmgm.2019.07.014

Abstract

The protein sequence-structure gap results from the contrast between rapid, low-cost deep sequencing, and slow, expensive experimental structure determination techniques. Comparative homology modelling may have the potential to close this gap by predicting protein structure in target sequences using existing experimentally solved structures as templates. This paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable RNA-dependent RNA polymerase (RdRP) target-template pairs within human-infective RNA virus genera. Measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (CNNs) potentially useful for production of homology models most representative of their genera. Homology modelling was then carried out for target-template pairs in different species, different genera and different families, and model quality assessed using several metrics. Reconstructed ancestral RdRP sequences for individual genera were also used as templates for the production of ancestral RdRP homology models. High quality ancestral RdRP models were consistently produced, as were good quality models for target-template pairs in the same genus. Homology modelling between genera in the same family produced mixed results and inter-family modelling was unreliable. We present a protocol for the production of optimal RdRP homology models for use in further experiments, e.g. docking to discover novel anti-viral compounds. (219 words).

Keywords: Force-directed graphs; Fruchterman-Reingold algorithm; Homology modelling; Multidimensional scaling; Protein structure; RNA-Dependent RNA polymerase; Reverse transcriptase; Sequence space; Sequence-structure gap.

PubMed Disclaimer

Figures

**Fig. 1**
Force-directed graph visualisations of similarity of RdRPs (or reverse transcriptase for *Lentivirus*) within genera. The genetic distance matrix for each alignment was converted into a similarity matrix Equations (1), (2). The Fruchterman-Reingold algorithm (500 minimisation iterations) was implemented in R module *qgraph* to produce a force-directed graph. Relative similarity is represented by node proximity, and absolute similarity is proportional to edge thickness. The solved structure and the three types of centroid nearest neighbour (CNN) sequences are highlighted. The species names corresponding to the numbered nodes are listed in the Supplementary Table. *Cardiovirus* has less than four reference sequences and is omitted. A: Location of solved structure and the three CNNs in sequence space Equations (3), (4), (5), (6), (7). Some genera have two median CNNs.

**Fig. 2**
Visualization of sequence space in two and three dimensions for *Orthohantavirus*. Multi-dimensional scaling on the *Orthohantavirus* similarity matrix was implemented in R module *cmdscale* and viewed in Spotfire Analyst. Inset: the *Orthohantavirus* Fruchterman-Reingold representation from Fig. 1. The solved structure and the three types of centroid nearest neighbour (CNN) sequences are highlighted. The species names corresponding to the numbered nodes are listed in the Supplementary Table.

**Fig. 3**
Visualization of sequence space in two and three dimensions for *Mammarenavirus*. Multi-dimensional scaling on the *Mammarenavirus* similarity matrix was implemented in R module *cmdscale* and viewed in Spotfire Analyst. Inset: the *Mammarenavirus* Fruchterman-Reingold representation from Fig. 1. The solved structure and the three types of centroid nearest neighbour (CNN) sequences are highlighted. The species names corresponding to the numbered nodes are listed in the Supplementary Table.

**Fig. 4**
Homology models, Ramachandran (Φ-Ψ) plots and QMEAN Z-scores graphics for the “best” and “worst” intra-genus model. A: Superposition of *Rotavirus I* model (orange) on *Rotavirus A* template 2R7O (pink). B: Superposition of *American bat vesiculovirus* model (orange) on *Indiana vesiculovirus* template 5A22 (pink). C: Ramachandran (Φ-Ψ) plot for *Rotavirus I* model. D: Ramachandran (Φ-Ψ) plot for *American bat vesiculovirus* model. E: QMEAN Z-scores graphic for *Rotavirus I* model. F: QMEAN Z-scores graphic for *American bat vesiculovirus* model. The Φ-Ψ plots (C,D) show Ψ on the y-axis and Φ on the x-axis. Bond angle quality: favoured (), allowed (), and outliers ( cross, text). The Z-score graphics show model quality on a sliding scale: low-quality (), high-quality (). QMEAN4 shows the overall Z-score, “All Atom” shows the average Z-score for all of the atoms in the model, “CBeta” the Z-score for all Cβ carbons, “Solvation” is a measure of how accessible the residues are to solvents, and “Torsion” is a measure of torsion angle for each residue compared to adjacent residues.

formula image — **Fig. 4**
Homology models, Ramachandran (Φ-Ψ) plots and QMEAN Z-scores graphics for the “best” and “worst” intra-genus model. A: Superposition of *Rotavirus I* model (orange) on *Rotavirus A* template 2R7O (pink). B: Superposition of *American bat vesiculovirus* model (orange) on *Indiana vesiculovirus* template 5A22 (pink). C: Ramachandran (Φ-Ψ) plot for *Rotavirus I* model. D: Ramachandran (Φ-Ψ) plot for *American bat vesiculovirus* model. E: QMEAN Z-scores graphic for *Rotavirus I* model. F: QMEAN Z-scores graphic for *American bat vesiculovirus* model. The Φ-Ψ plots (C,D) show Ψ on the y-axis and Φ on the x-axis. Bond angle quality: favoured (), allowed (), and outliers ( cross, text). The Z-score graphics show model quality on a sliding scale: low-quality (), high-quality (). QMEAN4 shows the overall Z-score, “All Atom” shows the average Z-score for all of the atoms in the model, “CBeta” the Z-score for all Cβ carbons, “Solvation” is a measure of how accessible the residues are to solvents, and “Torsion” is a measure of torsion angle for each residue compared to adjacent residues.

**Fig. 5**
Flowchart of recommended strategy for choice of RdRP for docking experiments. Where a solved RdRP structure exists in a genus, it should be used. However, if that solved structure is not a CNN, a homology model of a CNN or ancestral sequence should be produced for comparative purposes. Where no solved RdRP structure exists in a genus, a structure from another genus in the same family may be used.

See this image and copyright information in PMC

References

1. Kendrew J.C., Bodo G., Dintzis H.M., Parrish R.G., Wyckoff H., Phillips D.C. A three-dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature. 1958;181:662–666. - PubMed
1. Schwede T. Protein modeling: what happened to the “protein structure gap”? Structure. 2013;21:1531–1540. - PMC - PubMed
1. Kaczanowski S., Siedlecki P., Zielenkiewicz P. The high throughput sequence annotation service (HT-SAS) - the shortcut from sequence to true medline words. BMC Bioinf. 2009;10:148. - PMC - PubMed
1. Holmes E.C. Oxford University Press; Oxford, UK: 2009. The Evolution and Emergence of RNA Viruses.
1. Lu G., Gong P. Crystal Structure of the full-length Japanese encephalitis virus NS5 reveals a conserved methyltransferase-polymerase interface. PLoS Pathog. 2013;9 - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling

Affiliations

Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling

Authors

Affiliations

Abstract

Figures

Similar articles

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous