. 2010 Nov;11(7):573-88.

doi: 10.2174/138920310794109139.

Disease risk of missense mutations using structural inference from predicted function

Jeremy A Horst¹, Kai Wang, Orapin V Horst, Michael L Cunningham, Ram Samudrala

Affiliations

PMID: 20887259
PMCID: PMC3095817
DOI: 10.2174/138920310794109139

Disease risk of missense mutations using structural inference from predicted function

Jeremy A Horst et al. Curr Protein Pept Sci. 2010 Nov.

. 2010 Nov;11(7):573-88.

doi: 10.2174/138920310794109139.

Authors

Jeremy A Horst¹, Kai Wang, Orapin V Horst, Michael L Cunningham, Ram Samudrala

Affiliation

¹ Department of Microbiology School of Medicine, University of Washington, 1959 NE Pacific St 357132, Seattle, WA 98195, USA.

PMID: 20887259
PMCID: PMC3095817
DOI: 10.2174/138920310794109139

Abstract

Advancements in sequencing techniques place personalized genomic medicine upon the horizon, bringing along the responsibility of clinicians to understand the likelihood for a mutation to cause disease, and of scientists to separate etiology from nonpathologic variability. Pathogenicity is discernable from patterns of interactions between a missense mutation, the surrounding protein structure, and intermolecular interactions. Physicochemical stability calculations are not accessible without structures, as is the case for the vast majority of human proteins, so diagnostic accuracy remains in infancy. To model the effects of missense mutations on functional stability without structure, we combine novel protein sequence analysis algorithms to discern spatial distributions of sequence, evolutionary, and physicochemical conservation, through a new approach to optimize component selection. Novel components include a combinatory substitution matrix and two heuristic algorithms that detect positions which confer structural support to interaction interfaces. The method reaches 0.91 AUC in ten-fold cross-validation to predict alteration of function for 6,392 in vitro mutations. For clinical utility we trained the method on 7,022 disease associated missense mutations within the Online Mendelian inheritance in man amongst a larger randomized set. In a blinded prospective test to delineate mutations unique to 186 patients with craniosynostosis from those in the 95 highly variant Coriell controls and 1000 age matched controls, we achieved roughly 1/3 sensitivity and perfect specificity. The component algorithms retained during machine learning constitute novel protein sequence analysis techniques to describe environments supporting neutrality or pathology of mutations. This approach to pathogenetics enables new insight into the mechanistic relationship of missense mutations to disease phenotypes in our patients.

PubMed Disclaimer

Figures

**Fig. (1)**
New methods for prediction of mutational disruption. Methods assessing conservation (HMMRE, SSR), sequence derived structural patterns (Shells, CloseSS), and a combinatory amino acid substitution matrix (Matrices) are novely applied to the problem of predicting functional disruption by artificial missense mutations in the standard *in vitro* mutation test set assembled by the creators of SIFT [48]. This set is comprised by *in vitro* assay results for 336 mutations in HIV protease [79], 2015 in Bacteriophage T4 lysozyme [80], and 4044 in the *E coli* Lac repressor [81]. Parameters are trained for Shells (AUC=69.1), Matrices (AUC=67.4), CloseSS (AUC=68.3), and AA type (AUC=61.9). **(A)** The receiver operator characteristic of the five algorithms in ten fold cross validation (* indicates novel algorithms). Each algorithm performs better than random (Reference line) in all cases, each between 1.4-2 times more accurate than only considering amino acid type. HMMRE is most accurate (AUC=73.2) except in high specificity cases, for which SSR (AUC=66.5) performs better. **(B)** Low correlation between predictions of the different algorithms indicates additive predictive ability can be achieved by combination see Fig. (2).

**Fig. (2)**
Additive prediction value of combining novel algorithms. The philosophical derivation of predictive algorithms demonstrates separable improvements for combining sequence derived parameters of structure and function. Predicted structural features (Str from Seq, AUC=70.1) include disorder, secondary structure, solvation, contribution to disulfide bonds, and domain break points. Adding in predicted function from predicted structure (Fxn from Str, AUC=76.3) includes the Shells and CloseSS methods. Regression combination of HMMRE and SSR conservation methods and amino acid type is synoymous to our approach to predict residues with direct functional contribution measured by contacts with any interacting molecule [10], but here instead we consider functional contribution as positions for which mutations will disrupt protein function (MFS, AUC=76.9). Combining MFS with Str from Seq adds improvement (MFS Str, AUC=78.2). More sensitivity is added when adding the Fxn from Str algorithms (MFS StrFxn, AUC=80.7) which use only data already present in MFS Str, suggesting that the model of the structural environment by this sequence based algorithm is significant. Finally including the substitution matrices into the regression increases predictive ability (MFS StrFxn Matrix, AUC=83.2).

**Fig. (3)**
Improved prediction using less information; a novel approach to machine learning. Machine learning techniques address the challenge of combining predictive scores from unique individual algorithms into a unified prediction. Previous approaches to data type selection in protein informatics assume to combine all available data or follow an expert's intuition. We demonstrate that employing a sample preparation technique to decrease information greatly improves the predictions of a more complex machine learning method. For the preparation technique we employ reverse stepwise logistic regression (Rev Step LogR), which removes data types but does not significantly alter prediction outcomes. The blue markers demonstrate extremely similar accuracy profiles of logistic regression, before (AUC=83.22) and after (AUC=83.25) filtering insignificantly contributing information types with Rev Step LogR. The green line depicts accuracy of support vector machine (SVM) training without Rev Step LogR filtration steps (AUC=83.8). The red line shows the exact same SVM method applied after filtration (AUC=90.6), demonstrating far better specificity and sensitivity than reached when including all data types. The Rev Step LogR SVM (referred to as HUSCY in later figures) depiction highlights that the approach is generally novel to bioinformatics, creating perhaps the first example of improved performance of internal cross validation by avoiding overtraining.

**Fig. (4)**
Comparison to other methods for missense mutation phenotype prediction. Comparison of performance on the standard *in vitro* dataset for HUSCY (Rev Step LogR SVM in Fig. 3) to approaches previously published in the field: SNAP [36], SIFT [48], PolyPhen, [88], PMUT [55], and MAPP [49]. PMUT was designed to predict human mutations, not the microbial systems assessed here. Other methods including SIFT, SNAP, PMUT, and PolyPhen were not trained on this specific data set, and thus would not be anticipated to perform as well. Nonetheless the set is used to standardize comparison for the methods. Performance for PolyPhen taken from the SNAP paper. **(A)** ROC accuracy profiles. **(B)** Two state accuracy separated by the three protein reporter systems comprising the standard set. HUSCY and SNAP methods perform stably across the three proteins. These data demonstrate a contribution to the field of characterizing mechanisms of protein function on the stringent test of picking out single mutations that produce any experimentally measurable change in the assayed functions.

**Fig. (5)**
Accuracy profile for prediction of deleterious effects by mutations in Lac Repressor. The two state prediction accuracy of HUSCY (left), and SIFT (right) [48] for all 12 or 13 mutations at each position (of 4044 in *E coli* Lac repressor) [81], mapped onto the homodimer structure of LacRepressor bound to the operator DNA (PDBid 1lbg). Side chains built by SCWRL4 [96] are shown for all 328 residues for which mutations were made, colored as heat map from blue for perfect selection to red for no correct selections. Main chains colored to differentiate homodimer chains. DNA shown as simplified ellipsoids in 5′-3′ rainbow map. Two state accuracy includes correct prediction of either deleterious effects or no effects. Residues with <50% accuracy by HUSCY are shown as ball and stick in both renderings. The residues for which HUSCY displays poor performance are clustered at the protein homodimer interface and the allolactose binding site. Future improvements are directed by this analysis to include terms for interaction interface prediction. Clearly we already achieve our goal of accurate prediction for the interface support residues, bringing forward the field of sequence based prediction of destabilizing mutations.

**Fig. (6)**
Prediction of disease related nonsynonymous SNPs in OMIM. **(A)** Receiver operating characteristic for selection of the 7,022 non-synonymous SNPs recognized by the Online Mendelian inheritance in man (OMIM) [84] as contributory to human disease versus a negative control set of 31,698 randomly generated nonsynonymous SNPs we created to match the distribution of occurrence to all those observed in patients (PMD human *in vivo* subset) [85]. Predictive ability is gained from training the combination on this data set in ten fold cross validation. It might be surprising from the figure that HUSCY reaches an two state accuracy of 85% (98.5% specificity, 17.5% sensitivity; AUC=67.7), but there are 4.5 times more neutral instances than deleterious cases. This prediction value of 70% above random for clinical data has not been achieved previously. Rev Step LogR results in consistent selection of parameters across the ten derivations, which suggests stability of the algorithm (AUC=71.9). The HUSCY method trained on the standard *in vitro* set (InVitro, AUC=58.7) does not perform as well as simply considering the amino acid type (AUC=61.9), which highlights the difference in these problems (discussed in Results). Further disclosures for this data set include: passive nonsynonymous SNPs in humans are not yet known and so are modeled here, therefore many of the instances taken as negative would actually effect function as positives; many of the positive instances have not been thoroughly evaluated, e.g. in multiple prospectively studied populations. **(B)** We trained a specific amino acid substitution scoring matrix to select disease related nonsynonymous SNPs (gray circles in (A), AUC=67.8) as a combination of those in the AAindex database [35] which do not require other features such as secondary structure or solvation. The matrix demonstrates marginally higher accuracy than a sophisticated conservation measure trained for this purpose (MFS; green line in (A), AUC=67.0). Higher values are predictive of disease relation. Coloring is presented as a heat map, with red representing stronger predictions of disease and green representing minimal chance of causing disease. Only the mutations possible from a single nucleotide change are shown (i.e. nonsynonymous SNPs). The matrix values converge to two significant figures across the ten cross validation training sets. This matrix can be applied instantaneously as a simple look up table for clinicians not familiar with protein informatics.

See this image and copyright information in PMC

Cited by

Exome sequencing identifies a recurrent de novo ZSWIM6 mutation associated with acromelic frontonasal dysostosis.
Smith JD, Hing AV, Clarke CM, Johnson NM, Perez FA, Park SS, Horst JA, Mecham B, Maves L, Nickerson DA; University of Washington Center for Mendelian Genomics; Cunningham ML. Smith JD, et al. Am J Hum Genet. 2014 Aug 7;95(2):235-40. doi: 10.1016/j.ajhg.2014.07.008. Am J Hum Genet. 2014. PMID: 25105228 Free PMC article.

References

1. Wang Z, Moult J. SNPs, Protein Structure, and Disease. Hum Mutat. 2001;17(4):263–270. - PubMed
1. Guerois R, Nielsen JE, Serrano L. Predicting changes in the stability of proteins and protein complexes: A study of more than 1000 mutations. J Mol Biol. 2002;320(2):369–387. - PubMed
1. Yue P, Li Z, Moult J. Loss of protein structure stability as a major causative factor in monogenic disease. J Mol Biol. 2005;353(2):459–473. - PubMed
1. Pakula AA, Sauer RT. Genetic analysis of protein stability and function. Annu Rev Genet. 1989;23:289–310. - PubMed
1. Allali-Hassani A, Wasney GA, Chau I, Hong BS, Senisterra G, Loppnau P, Shi Z, Moult J, Edwards AM, Arrowsmith CH, Park HW, Schapira M, Vedadi M. A survey of proteins encoded by non-synonymous single nucleotide polymorphisms reveals a significant fraction with altered stability and activity. Biochem J. 2009;424(1):15–26. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

DP1 LM011509/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Disease risk of missense mutations using structural inference from predicted function

Affiliation

Disease risk of missense mutations using structural inference from predicted function

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources