Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov:92:180-191.
doi: 10.1016/j.jmgm.2019.07.014. Epub 2019 Jul 26.

Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling

Affiliations

Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling

Dylan J T Mead et al. J Mol Graph Model. 2019 Nov.

Abstract

The protein sequence-structure gap results from the contrast between rapid, low-cost deep sequencing, and slow, expensive experimental structure determination techniques. Comparative homology modelling may have the potential to close this gap by predicting protein structure in target sequences using existing experimentally solved structures as templates. This paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable RNA-dependent RNA polymerase (RdRP) target-template pairs within human-infective RNA virus genera. Measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (CNNs) potentially useful for production of homology models most representative of their genera. Homology modelling was then carried out for target-template pairs in different species, different genera and different families, and model quality assessed using several metrics. Reconstructed ancestral RdRP sequences for individual genera were also used as templates for the production of ancestral RdRP homology models. High quality ancestral RdRP models were consistently produced, as were good quality models for target-template pairs in the same genus. Homology modelling between genera in the same family produced mixed results and inter-family modelling was unreliable. We present a protocol for the production of optimal RdRP homology models for use in further experiments, e.g. docking to discover novel anti-viral compounds. (219 words).

Keywords: Force-directed graphs; Fruchterman-Reingold algorithm; Homology modelling; Multidimensional scaling; Protein structure; RNA-Dependent RNA polymerase; Reverse transcriptase; Sequence space; Sequence-structure gap.

PubMed Disclaimer

Figures

Image 1
Graphical abstract
Fig. 1
Fig. 1
Force-directed graph visualisations of similarity of RdRPs (or reverse transcriptase for Lentivirus) within genera. The genetic distance matrix for each alignment was converted into a similarity matrix Equations (1), (2). The Fruchterman-Reingold algorithm (500 minimisation iterations) was implemented in R module qgraph to produce a force-directed graph. Relative similarity is represented by node proximity, and absolute similarity is proportional to edge thickness. The solved structure and the three types of centroid nearest neighbour (CNN) sequences are highlighted. The species names corresponding to the numbered nodes are listed in the Supplementary Table. Cardiovirus has less than four reference sequences and is omitted. A: Location of solved structure and the three CNNs in sequence space Equations (3), (4), (5), (6), (7). Some genera have two median CNNs.
Fig. 2
Fig. 2
Visualization of sequence space in two and three dimensions for Orthohantavirus. Multi-dimensional scaling on the Orthohantavirus similarity matrix was implemented in R module cmdscale and viewed in Spotfire Analyst. Inset: the Orthohantavirus Fruchterman-Reingold representation from Fig. 1. The solved structure and the three types of centroid nearest neighbour (CNN) sequences are highlighted. The species names corresponding to the numbered nodes are listed in the Supplementary Table.
Fig. 3
Fig. 3
Visualization of sequence space in two and three dimensions for Mammarenavirus. Multi-dimensional scaling on the Mammarenavirus similarity matrix was implemented in R module cmdscale and viewed in Spotfire Analyst. Inset: the Mammarenavirus Fruchterman-Reingold representation from Fig. 1. The solved structure and the three types of centroid nearest neighbour (CNN) sequences are highlighted. The species names corresponding to the numbered nodes are listed in the Supplementary Table.
Fig. 4
Fig. 4
Homology models, Ramachandran (Φ-Ψ) plots and QMEAN Z-scores graphics for the “best” and “worst” intra-genus model. A: Superposition of Rotavirus I model (orange) on Rotavirus A template 2R7O (pink). B: Superposition of American bat vesiculovirus model (orange) on Indiana vesiculovirus template 5A22 (pink). C: Ramachandran (Φ-Ψ) plot for Rotavirus I model. D: Ramachandran (Φ-Ψ) plot for American bat vesiculovirus model. E: QMEAN Z-scores graphic for Rotavirus I model. F: QMEAN Z-scores graphic for American bat vesiculovirus model. The Φ-Ψ plots (C,D) show Ψ on the y-axis and Φ on the x-axis. Bond angle quality: favoured (formula image), allowed (formula image), and outliers (formula image cross, formula image text). The Z-score graphics show model quality on a sliding scale: low-quality (formula image), high-quality (formula image). QMEAN4 shows the overall Z-score, “All Atom” shows the average Z-score for all of the atoms in the model, “CBeta” the Z-score for all Cβ carbons, “Solvation” is a measure of how accessible the residues are to solvents, and “Torsion” is a measure of torsion angle for each residue compared to adjacent residues.
Fig. 5
Fig. 5
Flowchart of recommended strategy for choice of RdRP for docking experiments. Where a solved RdRP structure exists in a genus, it should be used. However, if that solved structure is not a CNN, a homology model of a CNN or ancestral sequence should be produced for comparative purposes. Where no solved RdRP structure exists in a genus, a structure from another genus in the same family may be used.

Similar articles

References

    1. Kendrew J.C., Bodo G., Dintzis H.M., Parrish R.G., Wyckoff H., Phillips D.C. A three-dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature. 1958;181:662–666. - PubMed
    1. Schwede T. Protein modeling: what happened to the “protein structure gap”? Structure. 2013;21:1531–1540. - PMC - PubMed
    1. Kaczanowski S., Siedlecki P., Zielenkiewicz P. The high throughput sequence annotation service (HT-SAS) - the shortcut from sequence to true medline words. BMC Bioinf. 2009;10:148. - PMC - PubMed
    1. Holmes E.C. Oxford University Press; Oxford, UK: 2009. The Evolution and Emergence of RNA Viruses.
    1. Lu G., Gong P. Crystal Structure of the full-length Japanese encephalitis virus NS5 reveals a conserved methyltransferase-polymerase interface. PLoS Pathog. 2013;9 - PMC - PubMed