Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 21;29(1):52.
doi: 10.3390/molecules29010052.

Structural Outlier Detection and Zernike-Canterakis Moments for Molecular Surface Meshes-Fast Implementation in Python

Affiliations

Structural Outlier Detection and Zernike-Canterakis Moments for Molecular Surface Meshes-Fast Implementation in Python

Mateusz Banach. Molecules. .

Abstract

Object retrieval systems measure the degree of similarity of the shape of 3D models. They search for the elements of the 3D model databases that resemble the query model. In structural bioinformatics, the query model is a protein tertiary/quaternary structure and the objective is to find similarly shaped molecules in the Protein Data Bank. With the ever-growing size of the PDB, a direct atomic coordinate comparison with all its members is impractical. To overcome this problem, the shape of the molecules can be encoded by fixed-length feature vectors. The distance of a protein to the entire PDB can be measured in this low-dimensional domain in linear time. The state-of-the-art approaches utilize Zernike-Canterakis moments for the shape encoding and supply the retrieval process with geometric data of the input structures. The BioZernike descriptors are a standard utility of the PDB since 2020. However, when trying to calculate the ZC moments locally, the issue of the deficiency of libraries readily available for use in custom programs (i.e., without relying on external binaries) is encountered, in particular programs written in Python. Here, a fast and well-documented Python implementation of the Pozo-Koehl algorithm is presented. In contrast to the more popular algorithm by Novotni and Klein, which is based on the voxelized volume, the PK algorithm produces ZC moments directly from the triangular surface meshes of 3D models. In particular, it can accept the molecular surfaces of proteins as its input. In the presented PK-Zernike library, owing to Numba's just-in-time compilation, a mesh with 50,000 facets is processed by a single thread in a second at the moment order 20. Since this is the first time the PK algorithm is used in structural bioinformatics, it is employed in a novel, simple, but efficient protein structure retrieval pipeline. The elimination of the outlying chain fragments via a fast PCA-based subroutine improves the discrimination ability, allowing for this pipeline to achieve an 0.961 area under the ROC curve in the BioZernike validation suite (0.997 for the assemblies). The correlation between the results of the proposed approach and of the 3D Surfer program attains values up to 0.99.

Keywords: Numba; Python; Zernike moments; bioinformatics; computational geometry; molecular surface; principal component analysis; protein structure; shape retrieval.

PubMed Disclaimer

Conflict of interest statement

The author declares no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Figures

Figure 1
Figure 1
Common volumetric representations of a protein structure: (a) molecular surface mesh of all atoms, (b) molecular surface mesh of backbone atoms, (c) voxel grid of all atoms, and (d) voxel grid of backbone atoms. The protein has the PDB code 1KFV (chain A). The mesh on (a) has 13,670 vertices, 27,284 facets and a 33,504 Å3 volume. The mesh on (b) has 12,747 vertices, 25,578 facets and a 13,248 Å3 volume. The grids on (c,d) have 24,785 and 11,588 unit voxels, respectively.
Figure 2
Figure 2
Time needed by the proposed and reference implementations of the PK algorithm to return the ZC descriptors of (a) 1SZT (5988 vertices, 11,944 facets) and (b) 1AVO (39,123 vertices, 78,043 facets). The surface meshes were calculated in the all-atom mode and decimated by 50%.
Figure 3
Figure 3
Flowchart of the proposed protein retrieval pipeline. This image was authored with draw.io v. 21.6.8, (https://github.com/jgraph/drawio-desktop, accessed on 31 August 2023).
Figure 4
Figure 4
Change in the unit ball scale factor and in the ZC descriptors caused by the elimination of the outlier residues in 4B0H:A (ac) and 1IIE:A–C (df). The values in the legends on (a,b,d,e) are the scale factor and the numbers of outlier (blue diamonds) and guide (brown circles) residues. The range of the Y axis on (c,f) is limited for visibility—“max” is the value of the tallest descriptor.
Figure 5
Figure 5
The impact of adaptive (ae) and uniform (fj) decimation of the backbone atom molecular surface of 1SZT:A on its ZC descriptors. The legends on (e,j) denote the reduction factor, the volume of the decimated mesh, and the Δz versus the undecimated MSMS output, represented by the black lines on (e,j). The volume of the original mesh was 3275 Å3.
Figure 6
Figure 6
The impact of the unit ball scale factor on ZC descriptors of 1DIV:A: (a) rmax and 2rg, (b) rmax/0.7 and rPCA. The corresponding balls before scaling down to the unit size are shown on (c). The distance histogram on (d) is independent of the scale factor. Its bin size is 2 Å.
Figure 7
Figure 7
ROC curves for the proposed protein structure retrieval pipeline with the highest sum of the Δzdsv AUROCs in CATH (ae), ECOD (fj) and assembly (ko) BioZernike test suites. The settings were as follows: backbone atom mesh, outlier detection on, and 2rg as the unit ball scale factor. The legends denote the shape distance function, the AUROC and the coordinates of the markers. Colors of those markers correspond to the upper bounds in Table 1 below which the structures were considered similar. A scalable version of this figure is available as Figure S28 in Supplemental File S2.
Figure 8
Figure 8
Δzdsv heatmap for the 30 inputs from Table 2 (435 pairs). The data are mirrored on both sides of the diagonal. Color scale of the shape distance: black—[0,0.2), brown—[0.2,0.4), orange—[0.4,0.6), yellow—[0.6,0.8), green—[0.8,1.0), gray—[1.0,2.0], the same ranges as in Table 1. A scalable version of this figure is available as Figure S35 in Supplemental File S3.
Figure 9
Figure 9
Δzdsv heatmap for the 30 inputs from Table 2 (435 pairs). The results of the proposed protein structure retrieval pipeline are in the lower diagonal matrix. The results from the 3D Surfer program are in the upper diagonal matrix. See Figure 8 and Table 1 for the color scale. A scalable version of this figure is available as Figure S40 in Supplemental File S4.

Similar articles

References

    1. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. - DOI - PMC - PubMed
    1. Burley S.K., Bhikadiya C., Bi C., Bittrich S., Chen L., Crichlow G.V., Christie C.H., Dalenberg K., Di Costanzo L., Duarte J.M., et al. RCSB Protein Data Bank: Powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucleic Acids Res. 2020;49:D437–D451. doi: 10.1093/nar/gkaa1038. - DOI - PMC - PubMed
    1. PDB Statistics. [(accessed on 21 September 2023)]. Available online: https://www.rcsb.org/stats/summary.
    1. Bateman A., Martin M.J., Orchard S., Magrane M., Ahmad S., Alpi E., Bowler-Barnett E.H., Britto R., Bye-A-Jee H., Cukura A., et al. UniProt: The Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2022;51:D523–D531. doi: 10.1093/nar/gkac1052. - DOI - PMC - PubMed
    1. Senior A.W., Evans R., Jumper J., Kirkpatrick J., Sifre L., Green T., Qin C., Žídek A., Nelson A.W.R., Bridgland A., et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577:706–710. doi: 10.1038/s41586-019-1923-7. - DOI - PubMed

LinkOut - more resources