Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Apr 17:15:111.
doi: 10.1186/1471-2105-15-111.

A comprehensive study of small non-frameshift insertions/deletions in proteins and prediction of their phenotypic effects by a machine learning method (KD4i)

Affiliations

A comprehensive study of small non-frameshift insertions/deletions in proteins and prediction of their phenotypic effects by a machine learning method (KD4i)

Carlos Bermejo-Das-Neves et al. BMC Bioinformatics. .

Abstract

Background: Small insertion and deletion polymorphisms (Indels) are the second most common mutations in the human genome, after Single Nucleotide Polymorphisms (SNPs). Recent studies have shown that they have significant influence on genetic variation by altering human traits and can cause multiple human diseases. In particular, many Indels that occur in protein coding regions are known to impact the structure or function of the protein. A major challenge is to predict the effects of these Indels and to distinguish between deleterious and neutral variants. When an Indel occurs within a coding region, it can be either frameshifting (FS) or non-frameshifting (NFS). FS-Indels either modify the complete C-terminal region of the protein or result in premature termination of translation. NFS-Indels insert/delete multiples of three nucleotides leading to the insertion/deletion of one or more amino acids.

Results: In order to study the relationships between NFS-Indels and Mendelian diseases, we characterized NFS-Indels according to numerous structural, functional and evolutionary parameters. We then used these parameters to identify specific characteristics of disease-causing and neutral NFS-Indels. Finally, we developed a new machine learning approach, KD4i, that can be used to predict the phenotypic effects of NFS-Indels.

Conclusions: We demonstrate in a large-scale evaluation that the accuracy of KD4i is comparable to existing state-of-the-art methods. However, a major advantage of our approach is that we also provide the reasons for the predictions, in the form of a set of rules. The rules are interpretable by non-expert humans and they thus represent new knowledge about the relationships between the genotype and phenotypes of NFS-Indels and the causative molecular perturbations that result in the disease.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparison of conservation parameters for disease-causing and neutral NFS-Indels. P-value for chi-squared test, where the null hypothesis is that there is no significant difference the values of the parameter for deleterious and neutral variants.
Figure 2
Figure 2
Comparison of functional site parameters for disease-causing and neutral NFS-Indels. P-value for chi-squared test, where the null hypothesis is that there is no significant difference the values of the parameter for deleterious and neutral variants.
Figure 3
Figure 3
Comparison of amino acid volumes for disease-causing and neutral NFS-Indels. Top: volumes of the amino acids in the NFS-Indel. Middle and bottom: local perturbation of amino acid volumes caused by the NFS-Indel. P-value for chi-squared test, where the null hypothesis is that there is no significant difference between the values of the parameter for deleterious and neutral variants.
Figure 4
Figure 4
Comparison of hydrophobicity for disease-causing and neutral NFS-Indels. Top: hydrophobicity in the NFS-Indel. Middle and bottom: local perturbation of hydrophobicity caused by the NFS-Indel. P-value for chi-squared test, where the null hypothesis is that there is no significant difference between the values of the parameter for deleterious and neutral variants.
Figure 5
Figure 5
Comparison of polarity for disease-causing and neutral NFS-Indels. Top: polarity in the NFS-Indel. Middle and bottom: local perturbation of polarity caused by the NFS-Indel. P-value for chi-squared test, where the null hypothesis is that there is no significant difference between the values of the parameter for deleterious and neutral variants.
Figure 6
Figure 6
Comparison of charge for disease-causing and neutral NFS-Indels. Top: charge in the NFS-Indel. Middle and bottom: local perturbation of charge caused by the NFS-Indel. P-value for chi-squared test, where the null hypothesis is that there is no significant difference between the values of the parameter for deleterious and neutral variants.
Figure 7
Figure 7
Comparison of structural parameters for disease-causing and neutral NFS-Indels. P-value for chi-squared test, where the null hypothesis is that there is no significant difference between the values of the parameter for deleterious and neutral variants.
Figure 8
Figure 8
Comparison of other parameters for disease-causing and neutral NFS-Indels. P-value for chi-squared test, where the null hypothesis is that there is no significant difference between the values of the parameter for deleterious and neutral variants.
Figure 9
Figure 9
Precision of rules in the final selected rule set calculated for the whole data set and for the test set.
Figure 10
Figure 10
Correlation of the percentage of KD4i deleterious predictions with the allele frequencies of NFS-indels found in the 1000 Genome Project.
Figure 11
Figure 11
Histogram of occurrence of the different parameters in the rules that cover the kinesin heavy chain isoform 5A (Uniprot ID: Q12840) N256 deletion.

References

    1. Carling T, Correa P, Hessman O, Hedberg J, Skogseid B, Lindberg D, Rastad J, Westin G, Akerstrom G. Parathyroid MEN1 gene mutations in relation to clinical characteristics of nonfamilial primary hyperparathyroidism. J Clin Endocrinol Metab. 1998;83(8):2960–2963. - PubMed
    1. Collins FS, Brooks LD, Chakravarti A. A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. 1998;8(12):1229–1231. - PubMed
    1. Ferrer-Costa C, Orozco M, de la Cruz X. Sequence-based prediction of pathological mutations. Proteins. 2004;57(4):811–819. doi: 10.1002/prot.20252. - DOI - PubMed
    1. Friedrich A, Garnier N, Gagniere N, Nguyen H, Albou LP, Biancalana V, Bettler E, Deleage G, Lecompte O, Muller J, Moras D, Mandel JL, Toursel T, Moulinier L, Poch O. SM2PH-db: an interactive system for the integrated analysis of phenotypic consequences of missense mutations in proteins involved in human genetic diseases. Hum Mutat. 2010;31(2):127–135. doi: 10.1002/humu.21155. - DOI - PubMed
    1. Hunter DJ. Gene-environment interactions in human diseases. Nat Rev Genet. 2005;6(4):287–298. - PubMed

Publication types

LinkOut - more resources