. 2013;9(12):e1003382.

doi: 10.1371/journal.pcbi.1003382. Epub 2013 Dec 12.

Amino acid changes in disease-associated variants differ radically from variants observed in the 1000 genomes project dataset

Tjaart A P de Beer¹, Roman A Laskowski¹, Sarah L Parks¹, Botond Sipos¹, Nick Goldman¹, Janet M Thornton¹

Affiliations

Affiliation

¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genomes Campus, Cambridge, Cambridgeshire, United Kingdom.

PMID: 24348229
PMCID: PMC3861039
DOI: 10.1371/journal.pcbi.1003382

Amino acid changes in disease-associated variants differ radically from variants observed in the 1000 genomes project dataset

Tjaart A P de Beer et al. PLoS Comput Biol. 2013.

. 2013;9(12):e1003382.

doi: 10.1371/journal.pcbi.1003382. Epub 2013 Dec 12.

Authors

Tjaart A P de Beer¹, Roman A Laskowski¹, Sarah L Parks¹, Botond Sipos¹, Nick Goldman¹, Janet M Thornton¹

Affiliation

¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genomes Campus, Cambridge, Cambridgeshire, United Kingdom.

PMID: 24348229
PMCID: PMC3861039
DOI: 10.1371/journal.pcbi.1003382

Abstract

The 1000 Genomes Project data provides a natural background dataset for amino acid germline mutations in humans. Since the direction of mutation is known, the amino acid exchange matrix generated from the observed nucleotide variants is asymmetric and the mutabilities of the different amino acids are very different. These differences predominantly reflect preferences for nucleotide mutations in the DNA (especially the high mutation rate of the CpG dinucleotide, which makes arginine mutability very much higher than other amino acids) rather than selection imposed by protein structure constraints, although there is evidence for the latter as well. The variants occur predominantly on the surface of proteins (82%), with a slight preference for sites which are more exposed and less well conserved than random. Mutations to functional residues occur about half as often as expected by chance. The disease-associated amino acid variant distributions in OMIM are radically different from those expected on the basis of the 1000 Genomes dataset. The disease-associated variants preferentially occur in more conserved sites, compared to 1000 Genomes mutations. Many of the amino acid exchange profiles appear to exhibit an anti-correlation, with common exchanges in one dataset being rare in the other. Disease-associated variants exhibit more extreme differences in amino acid size and hydrophobicity. More modelling of the mutational processes at the nucleotide level is needed, but these observations should contribute to an improved prediction of the effects of specific variants in humans.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. The amino acid exchanges observed in human protein variants.**
The 1*. Amino acids are arranged by 1 letter code according to increasing hydrophobicity (least hydrophobic is left and most hydrophobic is right) using the Fauchère and Pliska scale . Yellow blocks indicate mutations where there are statistically significant differences between 1 kG and OMIM. Blue blocks indicate where no mutations were present in the 1 kG data set. White blocks show where there are no statistically significant differences. Green blocks show where there are proportionally more 1 kG mutations compared to OMIM. Orange blocks show where there are proportionally more OMIM mutations than 1 kG. The mutability scores (see methods) for the 1 kG and OMIM sets are shown in the last column. ^*Note that these matrices are fundamentally different. The 1 kG data set gathers all the observed mutations in the 1 kG project, counting each only once; the OMIM data set combines information gathered from potentially many individuals but filtered to identify those mutations associated with a disease.

**Figure 2. Comparison of the number of mutating residues vs the amino acid frequency of occurrence.**

**Figure 3. Amino acid mutability vs the number of codons in the 1 kG data.**

**Figure 4. A visual representation of the asymmetry of the 1 kG data.**
The plot shows the difference between how often an amino acid mutates vs how often it is mutated to. These are raw counts and also reflect the frequency of occurrence. Each amino acid is coloured according to CpG content. Red: a CpG dinucleotide occurs in its codons; yellow: if one of its codons start with a G (with a C possibly preceding it); blue: no CpG in its codons. The black line indicates the diagonal where ‘mutations to’ equals ‘mutations from’.

**Figure 5. Site properties for all residues, 1 kG nsSNPs, OMIM nsSNPs and Humsavar nsSNPs in the structure 3D set.**
(A) the solvent accessibility for the variants in the four datasets, (B) the secondary structure in which each of the variants occurs, (C) the functional annotation of every variant in the four datasets.

**Figure 6. Comparison of the conservation scores in the four sets used.**
The density distribution of residue conservation scores for all the amino acid positions in UniProt (9,532,474 residues, black), 1 kG (185,428 residues, blue), OMIM (8,099 residues, red) and Humsavar (21,446 residues, green). The conservation scores range from 0 for non-conserved residues to 1 for highly conserved residues.

**Figure 7. Comparison of the differences in observed mutations in the various sets.**
Comparison of the differences in the % of observed mutations in the 1 kG (blue) and OMIM (red) sets for one amino acid mutating to all others e.g. proportionally, more mutations from Lys to Glu are recorded in OMIM than in the 1 kG set. Each plot shows the results of mutation from a specific amino acid (e.g. Arg at top left) to every other amino acid.

**Figure 8. Comparison between the physicochemical properties of the wildtype and the mutant models for each of the data sets.**
Plots showing the differences between (A) Modeller DOPE scores for the wild type and mutant model (based on 3D, 10,628 mutations, and Humsavar sets, 21,446 residues), (B) changes in hydrophobicity between wild type and mutant in both sets and (C) changes in size between wild type and mutation in both sets.

**Figure 9. Bubble plots comparing the relative differences between the instantaneous rate change matrices of the data sets.**
(A) 1 kG data, (B) PAM matrix and (C) WAG matrix. (D) A PCA (first two components) plot showing the separation of the 1 kG matrices from other matrices. Matrices included are 1 kG (with and without assuming direction), nuclear (WAG, JTT, LG, PAM, tm126, PCMA), mitochondrial (mtREV24, mtMam, mtArt, mtZoa), chloroplast (cpREV, cpREV64), exposed (alpha helix, beta sheet, coil, turn) and buried (alpha helix, beta sheet, coil, turn). Principal components one and two represent 34% and 20% of the variance, respectively. All other principal components represent 9% or less of the variance each. Amino acids are arranged according to increasing hydrophobicity.

**Figure 10. Dependence of mutation rates on the change in CpG status.**
Rates of change from codons were calculated similarly to the amino acid rate matrix , but on a 61 by 61 codon matrix.

**Figure 11. Amino acid mutability rank order plot comparing the mutability scores for 1 kG, OMIM and Humsavar residues.**
The most mutable amino acids are at the top. Correlation coefficients for 1 kG vs OMIM, 1 kG vs Humsavar and OMIM vs Humsavar are 0.09, 0.17 and 0.51, respectively.

See this image and copyright information in PMC

Cited by

Exploring Novel Variants of the Cytochrome P450 Reductase Gene (POR) from the Genome Aggregation Database by Integrating Bioinformatic Tools and Functional Assays.
Rojas Velazquez MN, Therkelsen S, Pandey AV. Rojas Velazquez MN, et al. Biomolecules. 2023 Nov 30;13(12):1728. doi: 10.3390/biom13121728. Biomolecules. 2023. PMID: 38136599 Free PMC article.
Types and effects of protein variations.
Vihinen M. Vihinen M. Hum Genet. 2015 Apr;134(4):405-21. doi: 10.1007/s00439-015-1529-6. Epub 2015 Jan 24. Hum Genet. 2015. PMID: 25616435
Compensatory epistasis explored by molecular dynamics simulations.
Serrano C, Teixeira CSS, Cooper DN, Carneiro J, Lopes-Marques M, Stenson PD, Amorim A, Prata MJ, Sousa SF, Azevedo L. Serrano C, et al. Hum Genet. 2021 Sep;140(9):1329-1342. doi: 10.1007/s00439-021-02307-x. Epub 2021 Jun 26. Hum Genet. 2021. PMID: 34173867
VarMeter2: An enhanced structure-based method for predicting pathogenic missense variants through Mahalanobis distance.
Ohno S, Ogura C, Yabuki A, Itoh K, Manabe N, Angata K, Togayachi A, Aoki-Kinoshita K, Furukawa JI, Inamori KI, Inokuchi JI, Kaname T, Nishihara S, Yamaguchi Y. Ohno S, et al. Comput Struct Biotechnol J. 2025 Mar 1;27:1034-1047. doi: 10.1016/j.csbj.2025.02.008. eCollection 2025. Comput Struct Biotechnol J. 2025. PMID: 40160862 Free PMC article.
Insights into Disease-Associated Mutations in the Human Proteome through Protein Structural Analysis.
Gao M, Zhou H, Skolnick J. Gao M, et al. Structure. 2015 Jul 7;23(7):1362-9. doi: 10.1016/j.str.2015.03.028. Epub 2015 May 28. Structure. 2015. PMID: 26027735 Free PMC article.

See all "Cited by" articles

References

1. 1000 Genomes Project Consortium (2010) Durbin RM, Abecasis GR, Altshuler DL, Auton A, et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073. - PMC - PubMed
1. 1000 Genomes Project Consortium (2012) Abecasis GR, Auton A, Brooks LD, DePristo MA, et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. - PMC - PubMed
1. Iengar P (2012) An analysis of substitution, deletion and insertion mutations in cancer genes. Nucleic Acids Res 40: 6401–6413. - PMC - PubMed
1. Amberger J, Bocchini CA, Scott AF, Hamosh A (2009) McKusick's Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res 37: D793–D796. - PMC - PubMed
1. UniProt-Consortium (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38: D142–D148. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Amino acid changes in disease-associated variants differ radically from variants observed in the 1000 genomes project dataset

Affiliation

Amino acid changes in disease-associated variants differ radically from variants observed in the 1000 genomes project dataset

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources