Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014:1084:193-226.
doi: 10.1007/978-1-62703-658-0_11.

Principal component analysis: a method for determining the essential dynamics of proteins

Affiliations

Principal component analysis: a method for determining the essential dynamics of proteins

Charles C David et al. Methods Mol Biol. 2014.

Abstract

It has become commonplace to employ principal component analysis to reveal the most important motions in proteins. This method is more commonly known by its acronym, PCA. While most popular molecular dynamics packages inevitably provide PCA tools to analyze protein trajectories, researchers often make inferences of their results without having insight into how to make interpretations, and they are often unaware of limitations and generalizations of such analysis. Here we review best practices for applying standard PCA, describe useful variants, discuss why one may wish to make comparison studies, and describe a set of metrics that make comparisons possible. In practice, one will be forced to make inferences about the essential dynamics of a protein without having the desired amount of samples. Therefore, considerable time is spent on describing how to judge the significance of results, highlighting pitfalls. The topic of PCA is reviewed from the perspective of many practical considerations, and useful recipes are provided.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
(a) Eigenvalue Scree plot for first 100 modes of two example protein simulations (primary y-axis) and a random process (secondary y-axis), each having 225 dimensions. The units are angstrom squared (positional variance). (b) Average RMSIP scores for a random process in different vector space dimensions as a function of subspace dimension. Error bars show plus and minus one standard deviation
Fig. 2
Fig. 2
(a) RMSIP scores for inter-comparisons between three proteins each having 75 residues and a random process with 225 DOF. Only the true self-comparison yields a curve that saturates rapidly within a small essential space defined by the first nine modes. The decoy plots have much more in common with the protein dynamics of interest compared to the random process up to the first 30 modes. (b) The Z-scores for the RMSIP scores shown in panel a. (c) Comparison of two myosin V (795 residues) CEs run under different simulation conditions and a random process with 2,385 DOF. Again, note the rapid saturation of the RMSIP scores in an essential subspace defined by the first ten modes. (d) The Z-scores for the RMSIP scores in Panel c
Fig. 3
Fig. 3
The Kaiser-Meyer-Olkin MSA for (a) the FRODA and MD CEs each with 2,000 frames, and (b) for the FRODA CEs each with 10,000 frames. The overall KMO score is shown parenthetically in the legend. (c) Relationship between residue RMSD and MSA for MD. (d) Relationship between residue RMSD and MSA for FRODA. (e) Ribbon diagram colored by the MSA scores for MD. (f) Ribbon diagram colored by the MSA scores for FRODA
Fig. 4
Fig. 4
(a) Eigenvalue scree plots for the FRODA and MD CEs showing both the correlation explained in each mode and the cumulative correlations (Since the PCA was based on the correlation matrix). (b) The conformation RMSD of the MD and FRODA trajectories. Each value is with respect to the starting structure (crystal structure). (c) The residue RMSD for the MD and FRODA trajectories
Fig. 5
Fig. 5
The correlations between the first ten variables and the top two PCs. Notice how these variables form a tight cluster with small angles between each, indicating that they are correlated on these PCs. The boundary line on right is an arc of the unit circle to indicate how close the values are to 1
Fig. 6
Fig. 6
The eigenvector collectivity (EVC) for the entire set of eigenvectors from both the MD and FRODA PCA. Note that the mode index is plotted with decreasing size of the eigenvalue, so mode index 1 is the top mode. This plot indicates that the collectivity measure should not be of primary concern
Fig. 7
Fig. 7
The RMSD and the top three RMSD modes are compared from (a) MD and (b) FRODA PCA
Fig. 8
Fig. 8
(a) MD and (b) FRODA displacement vectors are projected onto their respective top two PCs as a scatter plot
Fig. 9
Fig. 9
(a) The cumulative overlap (CO) of each MD eigenvector with the entire set of FRODA eigenvectors defining the subspace of indicated size. We do not show the reverse metric, which is not symmetric, but yields similar values. (b) The RMSIP scores for the comparisons of random processes with 453 DOF, two FRODA simulations using the same conditions, and the MD and FRODA simulations. Error bars on the random process scores indicate plus and minus one standard deviation for 50 iterations. (c) The Z-scores for the RMSIP scores. (d) The PA spectra for the comparisons of the MD and FRODA simulations using the indicated SS DIM
Fig. 10
Fig. 10
Cluster separation for the dynamics of four different proteins using different kernels, but all using the same CE containing trajectories involving 2,000 FRODA frames for each of the four proteins. (a) Linear kernel equivalent to standard PCA. (b) Homogeneous polynomial kernel of degree two, which is sensitive to fourth order statistics. (c) Gaussian kernel with standard deviation set to 50. (d) Neural net kernel with no offset and a slope parameter set to 10−4. (e) Mutual Information kernel. (f) Subspace comparisons of the four kernels in bd using the linear kernel essential space as the reference. The SS DIM in all cases was five. The primary y-axis shows RMSIP scores while the secondary y-axis shows the principal angle value in degrees

References

    1. Pearson K. On lines and planes of closest fit to systems of points in space. The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science. 1901;2:572.
    1. Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933;24:441.
    1. Manly B. Multivariate statistics—a primer. Chapman & Hall/CRC; Boca Raton, FL: 1986.
    1. Abdi H, Williams LJ. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics. 2010;2:433–459.
    1. Jolliffe IT. Principal component analysis, vol XXIX, 2nd edn, Springer series in statistics. Springer; New York: 2002. p. 487.p. 28. illus.

LinkOut - more resources