Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 20:11:giac086.
doi: 10.1093/gigascience/giac086.

d-StructMAn: Containerized structural annotation on the scale from genetic variants to whole proteomes

Affiliations

d-StructMAn: Containerized structural annotation on the scale from genetic variants to whole proteomes

Alexander Gress et al. Gigascience. .

Abstract

Background: Structural annotation of genetic variants in the context of intermolecular interactions and protein stability can shed light onto mechanisms of disease-related phenotypes. Three-dimensional structures of related proteins in complexes with other proteins, nucleic acids, or ligands enrich such functional interpretation, since intermolecular interactions are well conserved in evolution.

Results: We present d-StructMAn, a novel computational method that enables structural annotation of local genetic variants, such as single-nucleotide variants and in-frame indels, and implements it in a highly efficient and user-friendly tool provided as a Docker container. Using d-StructMAn, we annotated several very large sets of human genetic variants, including all variants from ClinVar and all amino acid positions in the human proteome. We were able to provide annotation for more than 46% of positions in the human proteome representing over 60% proteins.

Conclusions: d-StructMAn is the first of its kind and a highly efficient tool for structural annotation of protein-coding genetic variation in the context of observed and potential intermolecular interactions. d-StructMAn is readily applicable to proteome-scale datasets and can be an instrumental building machine-learning tool for predicting genotype-to-phenotype relationships.

Keywords: Docker container; Single-nucleotide variants; genetic variation; indels; protein interactions; protein structure; structural annotation.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1:
Figure 1:
Proportion of proteins and positions from the human proteome dataset that could be mapped to structure data. Experimentally resolved structure (green) denotes that the protein (position) was mapped to at least 1 structure with sequence identity ≥0.99, and structure of a homolog (blue) denotes that the protein (position) was mapped to at least 1 structure with sequence identity in the range from 0.35 to 0.99. Modeled structure (purple) denotes that the protein (position) could only be mapped into a modeled structure that is not directly supported by experimental data. Disordered (gray) denotes proteins and positions that could not be mapped to any structure but are predicted by IUpred3 [34] to be disordered (for proteins, all positions have to be predicted to be disordered). No structure (red) denotes all other proteins and positions.
Figure 2:
Figure 2:
Distribution of the number of structures that could be mapped to a position in the annotation of the human proteome dataset. Each subplot shows the same distribution with a different zoom.
Figure 3:
Figure 3:
Distribution of the proportion of positions per protein that could be mapped to a structure in the annotation of the human proteome dataset. Green: fraction of positions annotated using multiple structures found by StructMAn; blue: fraction of positions annotated using only 1 structure per protein (highest sequence similarity was used, in case of same-sequence similarity; higher alignment coverage was preferred); red: fraction of positions annotated without considering structures of homologs.
Figure 4:
Figure 4:
Each stacked barplot denotes the distribution of structural classifications for a dataset. Only positions that could be mapped to at least 1 structure are considered for this figure. Protein interaction: amino acids that are part of a protein–protein interaction interface; Misc interaction: amino acids that are part of an interaction with a nonprotein partner (DNA, for example); Core: amino acids in the core of the protein; Surface: amino acids classified to have access to solvent (and not involved in interactions).
Figure 5:
Figure 5:
Violin plots for 4 example features. Left and right plots display the distribution of feature values for benign and pathogenic variants in ClinVar, respectively. (A) Relative surface area (RSA) value for chain atoms in the structures used for annotation of the wild-type protein. Only DelIns (multiresidue substitutions). (B) Same as A, but for the structures used for annotation of the mutant protein. (C) Only deletions. The number of spatial interactions to other amino acids in the same polypeptide chain and separated by more than 6 residues in the sequence. (D) Only insertions. Median solvent access of residues from other proteins (co-crystallized structures) that lie in a 10 Å sphere around the annotated residue.
Figure 6:
Figure 6:
Scatterplot showing runtime performance of StructMAn using different systems and different configurations. Red markers denote a normal desktop computer and blue markers denote a high-performance computing server. Different marker shapes denote different configurations of d-StructMAn.
Figure 7:
Figure 7:
Schematic of computational pipeline of StructMAn. Green boxes are computational sections, red boxes are data structures, and blue boxes are data sources.
Figure 8:
Figure 8:
Structural classes are assigned by a decision tree based on the results from the annotation aggregation. The classification aims to describe the functional role of an amino acid residue in the protein structure.
Figure 9:
Figure 9:
The results aggregation for indels is based on the position-specific results aggregation of the wild-type (WT) protein sequence and the results aggregation of the mutant (MUT) protein sequence. For both protein variants, 3 separate aggregations are performed: left flank of indel, indel region, and right flank of indel. Note that for a insertion, the length for the indel region in the WT is zero, and hence only the flanks produce feature lists (vice versa for deletions and MUT).

References

    1. 1000 Genomes Project Consortium . A map of human genome variation from population scale sequencing. Nature. 2010;467(7319):1061–73. - PMC - PubMed
    1. Eilbeck K, Quinlan A, Yandell M. Settling the score: variant prioritization and Mendelian disease. Nat Rev Genet. 2017;18(10):599–612. - PMC - PubMed
    1. Chen R, Mias GI, Li-Pook-Than J, et al. Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell. 2012;148(6):1293–307. - PMC - PubMed
    1. Amoah K, Hsiao YHE, Bahn JH, et al. Allele-specific alternative splicing and its functional genetic variants in human tissues. Genome Res. 2021;31(3):359–71. - PMC - PubMed
    1. Chong J, Buckingham K, Jhangiani S, et al. The genetic basis of mendelian phenotypes: discoveries, challenges, and opportunities. Am J Hum Genet. 2015;97(2):199–215. - PMC - PubMed

Publication types