Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep;633(8030):695-703.
doi: 10.1038/s41586-024-07899-8. Epub 2024 Sep 4.

Mapping glycoprotein structure reveals Flaviviridae evolutionary history

Affiliations

Mapping glycoprotein structure reveals Flaviviridae evolutionary history

Jonathon C O Mifsud et al. Nature. 2024 Sep.

Abstract

Viral glycoproteins drive membrane fusion in enveloped viruses and determine host range, tissue tropism and pathogenesis1. Despite their importance, there is a fragmentary understanding of glycoproteins within the Flaviviridae2, a large virus family that include pathogens such as hepatitis C, dengue and Zika viruses, and numerous other human, animal and emergent viruses. For many flaviviruses the glycoproteins have not yet been identified, for others, such as the hepaciviruses, the molecular mechanisms of membrane fusion remain uncharacterized3. Here we combine phylogenetic analyses with protein structure prediction to survey glycoproteins across the entire Flaviviridae. We find class II fusion systems, homologous to the Orthoflavivirus E glycoprotein in most species, including highly divergent jingmenviruses and large genome flaviviruses. However, the E1E2 glycoproteins of the hepaciviruses, pegiviruses and pestiviruses are structurally distinct, may represent a novel class of fusion mechanism, and are strictly associated with infection of vertebrate hosts. By mapping glycoprotein distribution onto the underlying phylogeny, we reveal a complex evolutionary history marked by the capture of bacterial genes and potentially inter-genus recombination. These insights, made possible through protein structure prediction, refine our understanding of viral fusion mechanisms and reveal the events that have shaped the diverse virology and ecology of the Flaviviridae.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Generation of a protein foldome for the Flaviviridae.
a, RdRp phylogeny reveals three major lineages within the Flaviviridae: (1) Orthoflavivirus/jingmenvirus (including orthoflavivirus-like—for example, Tamana bat virus); (2) LGF/Pestivirus; and (3) Hepacivirus/Pegivirus. An unrooted tree is shown, with the tombusviruses (TOM) representing the outgroup taxa and a scale bar denoting the number of amino acid substitutions per site. Genome organization is provided for exemplar species, with annotations based on InterProScan searches. b, Crystal structure of DENV-2 NS3 (left) shown alongside a ColabFold predicted structure for the corresponding region of the polyprotein (right). These structures superpose with a root mean square deviation (r.m.s.d.) of 1.6 Å. The predicted structure is colour-coded by per residue confidence scores (predicted local distance difference test (pLDDT)), as indicated in the bar. c, Scatter plot of MSA depth and prediction confidence (pLDDT). d, MSA depths for each sequence block in each genus or subclade, colour-coded as in a (orthoflavivirus-like viruses are included with the orthoflaviviruses). The mean is shown as a solid black line (n = 4,754 (Orthoflavivirus), 704 (jingmenvirus), 4,358 (LGF), 1,278 (Pestivirus), 2,904 (Hepacivirus) and 1,623 (Pegivirus) sequence blocks). e, Scatter plots representing prediction confidence (pLDDT) for ColabFold and ESMFold for each sequence block in each genus or subclade. Numerical values provide the performance ratio between the protein structure prediction methods; values below 1 indicate better performance by ColabFold.
Fig. 2
Fig. 2. Discovery of glycoproteins across the Flaviviridae.
a, RdRp phylogeny rooted on the tombusviruses (removed for visualization), with each genus or subclade colour-coded as in Fig. 1a. b, Foldseek structure-based homology e-value heat maps for the stated references, colour-coded as shown in the key. In the case of E1, E2, E, prM and MTase the values represent summary e values after comparison with a range of relevant reference structures (Methods). c, Host species tropism for each virus. ‘Vectored’ refers to those assigned as ‘Yes’ or ‘Potentially’ in Supplementary Table 3. Vertical lines within the heat map demark divisions between major clades. dg, Representative reference structures and Foldseek hits for E1 (d), E2 (e), E (f) and prM (g). For each hit only the Foldseek-aligned residues are shown for any given structure, metrics provide e value, sequence identity, structural alignment score (local distance difference test (LDDT), ranging from 0 to 1) and protein structure prediction method. Predicted structures are colour-coded by pLDDT confidence scores, as shown in the key. In d,e, the reference structures are previously published ColabFold models, In f,g, experimental structures are used (PDB: 7QRF and 6ZQI, respectively). CSFV, classical swine fever virus; HCV, hepatitis C virus.
Fig. 3
Fig. 3. Novel and acquired proteins in a large genome flavivirus.
a, N-terminal glycoproteins from BTV4. Linear representation of the BTV4 polyprotein displays location of putative glycosylation and the predicted signal peptidase cleavage sites that delineate five mature proteins (labelled A–E). For each of these mature proteins, the highest confidence models are shown from three prediction methods (ColabFold, ColabFold with custom MSAs and ESMFold). Protein B only yielded low-confidence models and is not shown. Each protein contains a putative transmembrane domain (pTMD), protein C contains a canonical furin cleavage site, and the conserved fusion loop (FL) of E is also annotated. b, LGF/Pestivirus lineage RdRp phylogeny and Foldseek e-value heat maps for the stated reference structures. Annotations provide the location of BTV4 (reference), the spider pestivirus-like viruses (Spider P-L), and the cartilaginous fish pestivirus-like viruses (CFish P-L). c, Example Foldseek hits against an experimental structure of bovine viral diarrhoea virus Erns ribonuclease (PDB: 4DVK). BVDV, bovine viral diarrhoea virus. d, Ribonuclease T2 (RNase T2) sequence phylogeny, with domains of life and viruses colour-coded as shown in the key. The scale bar indicates phylogenetic distance as number of substitutions per site. This protein has been independently acquired once by RNA viruses and twice by DNA viruses (in Mimiviridae and polydnaviruses). The RNA virus clade is nested within bacterial instances of the RNase T2, suggesting a single horizontal gene transfer event. e, Phylogeny of the RNA virus RNase T2/Erns clade rooted on non-viral sequences, with viral clades colour-coded as shown in the key. An uncollapsed version containing all tip labels is provided in Supplementary Fig. 8.
Fig. 4
Fig. 4. Structurally informed phylogenetics.
a, Left, 3Di-based E structural phylogeny. The scale bar indicates the number of 3Di character substitutions per site (see Methods for details of tree selection). Right, representative structures superposed using flexible FATCAT with a ColabFold model of West Nile virus E protein as reference (green). Structures are colour-coded as in the phylogeny. The protein alignments provide structurally aligned consensus-level amino acid sequences for the fusion loop, domain III and transmembrane domain. Conserved residues are highlighted. b, Left, combined 3Di and amino acid-based E1 structural phylogeny. The scale bar indicates the number of 3Di and amino acid character substitutions per site. Right, representative structures are superposed with Hepacivirus F E1 protein. Alignments demonstrate consensus-level homology in the E1 helical hairpin and transmembrane domain. Structures are colour-coded as in the phylogeny. c, Left, combined 3Di and amino acid-based structural phylogeny of E2 protein. Right, representative structures are superposed with Hepacivirus F E2. Consensus-level homology in E2 back layer, stem and transmembrane domain are provided. The basal Wenling moray eel Hepacivirus is marked in both the E1 and E2 trees. AA, amino acid.
Fig. 5
Fig. 5. Proposed evolutionary history of the Flaviviridae.
Illustrative cladogram showing the key protein acquisition and loss events across the major Flaviviridae clades. The two major lineages are labelled (Lin. 1 and Lin. 2), near the root. Each clade, displayed as a tip, is annotated with symbols representing the presence of key proteins. Branches are highlighted to denote the lineage-specific presence of envelope protein E (in light blue) or E1/E2 (in maroon). Major nodes are emphasized with larger symbols to infer the ancestral emergence of each protein within the Flaviviridae. Dashed lines and arrows denote the loss or gain of specific proteins, highlighting potential recombination events and gene transfers. Image of Aquifex pyrophilus by Guillaume Dera; CC BY 1.0 (https://creativecommons.org/licenses/by/1.0/).
Extended Data Fig. 1
Extended Data Fig. 1. Two-dimensional MDS plot of the NS5b phylogeny variations.
a, Coloured by alignment method b, Excluding trees generated from Clustal Omega alignments. Points, which represent individual phylogenies, are colour-coded based on the clusters identified using the ‘findGroves’ function. Point shapes indicate the alignment software, the trimAl gap and consensus thresholds, and the substitution model applied. An arrow signifies the master phylogeny (Tree 18) chosen for further analysis.
Extended Data Fig. 2
Extended Data Fig. 2. Structure prediction performance.
a, For structural inference, all Flaviviridae polyprotein sequences were split into blocks of 300 residues, each overlapping by 100 residues (461 species, 16,463 blocks in total). Residue numbers are provided for the first three blocks. b, Representative ColabFold protein structure predictions spanning the entire Dengue Virus 2 (DENV-2) polyprotein. Residue numbers are provided as in a. Structures are colour-coded by prediction confidence scores (pLDDT), as denoted in the key. c, pLDDT confidence scores along the length of the DENV-2 polyprotein. Dotted lines delineate the mature proteins, which are labelled on the x-axis. d, Scatter plots representing predicted TM-score (pTM) confidence metric for ColabFold and ESMFold for each sequence block in each genus/subclade. Numerical values provide the performance ratio between either protein structure prediction method; values below 1 indicates better performance by ColabFold.
Extended Data Fig. 3
Extended Data Fig. 3. Benchmarking with sequence-based homology searches.
a, RdRp phylogeny, as in Fig. 2a. b, Heatmap comparison of homology detection using Foldseek (as in Fig. 2), DIAMOND and InterProScan for the stated reference proteins. DIAMOND results represent a pure sequence-based recapitulation of the Foldseek search (i.e., all query and reference structures used in Foldseek were represented by their cognate protein sequences for DIAMOND analysis). Foldseek and DIAMOND data are log e-values and are colour-coded as in the key. InterProScan results provide a simple binary score: match (red) or no match (white). Vertical lines demark divisions between major clades.
Extended Data Fig. 4
Extended Data Fig. 4. ESMFold permits unambiguous detection of glycoproteins in divergent species.
a, RdRp phylogeny, as in Fig. 2a. b, E glycoprotein Foldseek e-value heatmaps for Flaviviridae structures predicted only by ColabFold (top), only by ESMFold (middle) or when combined (bottom). c, Representative examples of targets that are predicted well by ESMFold, but poorly by ColabFold. Structures are colour-coded by pLDDT confidence scores, as shown in the key. pLDDT and pTM metrics are provided for each model.
Extended Data Fig. 5
Extended Data Fig. 5. Analysis of environmental pesti-like viruses.
a, Large genome flavivirus/Pestivirus subset of the Flaviviridae phylogeny (Tree 18) collapsed to highlight the environmental pesti-like viruses for which no glycoproteins were identified. The scale bar denotes the number of amino acid substitutions per site. b, Genome organisation is provided for each species, with annotations based on conserved domain sequence searches. c, Foldseek e-value heatmaps for the indicated reference proteins, values are log transformed and colour-coded as shown in the key. For E, E1 and E2 the values represent summary e-values after comparison with a range of relevant reference structures, as described in the methods.
Extended Data Fig. 6
Extended Data Fig. 6. Absence of a fusion loop in E glycoprotein homologues of the jingmenviruses.
Structurally conserved E protein fusion loops (FL) were found in orthoflavi-, LGF-, and in pesti-like viruses. The FL is absent from the E protein homologue of the jingmenviruses (its expected location is marked by an asterisk). Amino acid side chains are shown for the FL only.
Extended Data Fig. 7
Extended Data Fig. 7. Foldseek detection of methyltransferase in diverse viruses.
DENV-2 reference structure and Foldseek hits for MTase. For each hit, only the Foldseek-aligned residues are shown, metrics provide e-value, sequence identity (%), structural alignment score (LDDT, ranging from 0 to 1), and protein structure prediction method. Predicted structures are colour-coded by pLDDT confidence scores, as shown in the key.
Extended Data Fig. 8
Extended Data Fig. 8. Conservation of structure revealed through the Foldseek 3Di alphabet.
Representative structures of E (West Nile virus) or E1 and E2 (Hepacivirus F) colour-coded by either sequence or structural conservation, as denoted in the key. In both cases values represent percentage conservation of the consensus for each structurally aligned protein (see Methods for details), with amino acid residues representing protein sequence and the 3Di structural alphabet representing protein structure.
Extended Data Fig. 9
Extended Data Fig. 9. Structurally aligned E glycoprotein phylogenies.
Protein sequences were aligned using their 3Di structural representation (see Methods for details). Phylogenies were reconstructed using 3Di sequence alone (top), amino acid (AA) sequence alone (middle) or combined 3Di and AA sequence. Right hand trees are derived from alignments trimmed with a gap threshold of 35%. Scale bars indicate substitutions per site for either the 3Di, AA or combined sequences, respectively. Tip shapes are colour-coded by genus/subclade as in Fig. 1a. All phylogenetic trees are provided in the associated Zenodo repository.
Extended Data Fig. 10
Extended Data Fig. 10. Structurally aligned E1 glycoprotein phylogenies.
Protein sequences were aligned using their 3Di structural representation (see Methods for details). Phylogenies were reconstructed using 3Di sequence alone (top), amino acid (AA) sequence alone (middle) or combined 3Di and AA sequence. Right hand trees are derived from alignments trimmed with a gap threshold of 35%. Scale bars indicate substitutions per site for either the 3Di, AA or combined sequences, respectively. Tip shapes are colour-coded by genus/subclade as in Fig. 1a. All phylogenetic trees are provided in the associated Zenodo repository.
Extended Data Fig. 11
Extended Data Fig. 11. Structurally aligned E2 glycoprotein phylogenies.
Protein sequences were aligned using their 3Di structural representation (see Methods for details). Phylogenies were reconstructed using 3Di sequence alone (top), amino acid (AA) sequence alone (middle) or combined 3Di and AA sequence. Right hand trees are derived from alignments trimmed with a gap threshold of 35%. Scale bars indicate substitutions per site for either the 3Di, AA or combined sequences, respectively. Tip shapes are colour-coded by genus/subclade as in Fig. 1a. All phylogenetic trees are provided in the associated Zenodo repository.

Similar articles

Cited by

References

    1. Grove, J. & Marsh, M. The cell biology of receptor-mediated virus entry. J. Cell Biol.195, 1071–1082 (2011). - PMC - PubMed
    1. Simmonds, P. et al. ICTV virus taxonomy profile: Flaviviridae. J. Gen. Virol.98, 2–3 (2017). - PMC - PubMed
    1. Rey, F. A. & Lok, S.-M. Common features of enveloped viruses and implications for immunogen design for next-generation vaccines. Cell172, 1319–1334 (2018). - PMC - PubMed
    1. Hubálek, Z. & Halouzka, J. West Nile fever—a reemerging mosquito-borne viral disease in Europe. Emerg. Infect. Dis.5, 643–650 (1999). - PMC - PubMed
    1. Wang, Z.-D. et al. A new segmented virus associated with human febrile illness in China. N. Engl. J. Med.380, 2116–2125 (2019). - PubMed

LinkOut - more resources