Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 12;6(5):101320.
doi: 10.1016/j.xplc.2025.101320. Epub 2025 Mar 24.

A molecular representation system with a common reference frame for analyzing triterpenoid structural diversity

Affiliations

A molecular representation system with a common reference frame for analyzing triterpenoid structural diversity

Nicole Babineau et al. Plant Commun. .

Abstract

Researchers have uncovered hundreds of thousands of natural products, many of which contribute to medicine, materials, and agriculture. However, missing knowledge about the biosynthetic pathways of these products hinders their expanded use. Nucleotide sequencing is key to pathway elucidation efforts, and analyses of the molecular structures of natural products, although seldom discussed explicitly, also play an important role by suggesting hypothetical pathways for testing. Structural analyses are also important in drug discovery, for which many molecular representation systems-methods of representing molecular structures in a computer-friendly format-have been developed. Unfortunately, pathway elucidation investigations seldom use these representation systems. This gap likely occurs because those systems are primarily built to document molecular connectivity and topology rather than the absolute positions of bonds and atoms in a common reference frame, which would enable chemical structures to be connected with potential underlying biosynthetic steps. Here, we expand on recently developed skeleton-based molecular representation systems by implementing a common-reference-frame-oriented system. We tested this system using triterpenoid structures as a case study and explored its applications in biosynthesis and structural diversity tasks. The common-reference-frame system can identify structural regions of high or low variability on the scale of atoms and bonds and enable hierarchical clustering that is closely connected to underlying biosynthesis. Combined with information on phylogenetic distribution, the system illuminates distinct sources of structural variability, such as different enzyme families operating in the same pathway. These characteristics outline the potential of common-reference-frame molecular representation systems to support large-scale pathway elucidation efforts.

Keywords: biosynthesis; molecular representation system; natural products.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A molecular representation system based on a grid-like template enables the identification of atom-to-atom correspondences. (A) Schematic representation of the atom-to-atom correspondence system. Triterpenoid molecules were aligned to a grid, and corresponding atoms were identified, enabling, for example, the identification of positions in the structures that were highly variable or highly conserved. (B) Example of two structures, cycloartenol and hederagenin, overlaid onto our grid template. The template used for all molecules is shown on the right. (C) Principal-component analysis comparing triterpenoid backbone structures in the TeroKit dataset. Each circle represents a unique skeleton found among TeroKit triterpenoid entries. Circle size corresponds to the number of triterpenoids with that skeleton in the TeroKit dataset, with the largest circle representing 13 434 compounds and the smallest circles representing instances for which there is only a single compound with that skeleton. Circles in yellow indicate scaffolds that are represented by the triterpenoids analyzed in this work, and gray circles indicate those not covered by our set of 112 triterpenoids. (D) The grid template with each atom position and bond position colored according to its index of quantitative variation. (E) The grid template with each atom position and bond position colored according to its variation ratio.
Figure 2
Figure 2
Hierarchical clustering of wax triterpenoid structures and structural variation within each major structural group. Clustering was determined using ward.D clustering on a distance matrix obtained by applying computed Gower distances (Gower, 1966) from a matrix of structural data for the triterpenoids. Four overall structural groups were identified; the general structure of each is shown in insets, with colors representing the variation ratio of each bond or atom position within the general structure. Branch widths represent support for respective nodes, with wider widths indicating more support. Cations underlying the biosynthesis of triterpenoids are shown on the right, with the gray dotted lines included simply to guide the eye with regard to the hierarchy of cations. Dotted gray arrows represent further rearrangements to generate the final products shown on the dendrogram. Abbreviations are defined in Supplemental Table 1.
Figure 3
Figure 3
A common-reference-frame molecular representation system reveals major dimensions of structural variability in triterpenoids from plant surface waxes. (A) Multiple-correspondence analysis of triterpenoid structures. (B) Ordination output from the multiple-correspondence analysis. Each circle represents a bond or atom position in the grid system. (C) Ordination output mapped onto the template grid system. Gray represents bond or atom positions with low contribution to variation in respective dimensions, while a darker color (red for dimension 1 and blue for dimension 2) indicates a bond or atom position that contributes substantially to variation in that dimension.
Figure 4
Figure 4
Major triterpenoid structural groups and triterpenoid presence across a plant phylogeny. Our 112 triterpenoid structures from literature reports were mapped onto a pruned phylogeny derived from a previous report (Qian and Jin, 2015) on the basis of occurrence within specific plant species. Compound names are listed on the bottom horizontal axis, and each box in the plot represents an instance of that triterpenoid being reported from the cuticular wax mixture of a given plant species (listed on the left vertical axis). Triterpenoid molecules are organized into colored groups based on major structural groups identified by our analysis in Figure 2; the ursane/oleananes group is blue, the taraxane/friedelanes group is red, the lupane/hopanes group is green, and the protosteranes are purple. The frequency with which a triterpenoid compound was reported across all plant species is shown in the top bar chart, and the number of triterpenoid compounds reported in a given plant species is shown in the right bar chart.
Figure 5
Figure 5
Co-occurrence of triterpenoids as well as atoms and bonds across structurally distinct molecules. (A) Each dot represents a triterpenoid, colored according to structural group as shown in the legend. Triterpenoids are shown as pairs, plotted on the x axis according to dissimilarity (the proportion of atoms or bonds not shared by the pair) and on the y axis according to the difference in observed and expected overlap frequency across the phylogeny in Figure 4. (B) Network in which each node represents a molecular feature (a particular atom in a certain grid position or a bond with a specific orientation in a given grid position). Edges between nodes indicate atoms/bonds that co-occur frequently among molecules in the dataset. Clusters colored gray indicate groups of four or more atoms/bonds that co-occur with high frequency. The pie chart of each gray cluster indicates the overall structural groups in which those co-occurring atoms/bonds are found.

References

    1. Baas W.J., Van Berkel I.E.M. 3,4-seco-triterpenoid Acids and Other Constituents of the Leaf Wax of Hoya Naumanii. Phytochemistry. 1991;30:1625–1628.
    1. Busta L., Serra O., Kim O.T., Molinas M., Peré-Fossoul I., Figueras M., Jetter R. Oxidosqualene cyclases involved in the biosynthesis of triterpenoids in Quercus suber cork. Sci. Rep. 2020;10 - PMC - PubMed
    1. Cao R., Freitas C., Chan L., Sun M., Jiang H., Chen Z. ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network. Molecules. 2017;22:1732. - PMC - PubMed
    1. Carelli M., Biazzi E., Panara F., Tava A., Scaramelli L., Porceddu A., Graham N., Odoardi M., Piano E., Arcioni S., et al. Medicago truncatula CYP716A12 Is a Multifunctional Oxidase Involved in the Biosynthesis of Hemolytic Saponins. Plant Cell. 2011;23:3070–3081. - PMC - PubMed
    1. Carroll E., Ravi Gopal B., Raghavan I., Mukherjee M., Wang Z.Q. A cytochrome P450 CYP87A4 imparts sterol side-chain cleavage in digoxin biosynthesis. Nat. Commun. 2023;14:4042. - PMC - PubMed

LinkOut - more resources