Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016;4(2):97-103.
doi: 10.2174/2213235X04666160613122429.

PCA as a practical indicator of OPLS-DA model reliability

Affiliations

PCA as a practical indicator of OPLS-DA model reliability

Bradley Worley et al. Curr Metabolomics. 2016.

Abstract

Background: Principal Component Analysis (PCA) and Orthogonal Projections to Latent Structures Discriminant Analysis (OPLS-DA) are powerful statistical modeling tools that provide insights into separations between experimental groups based on high-dimensional spectral measurements from NMR, MS or other analytical instrumentation. However, when used without validation, these tools may lead investigators to statistically unreliable conclusions. This danger is especially real for Partial Least Squares (PLS) and OPLS, which aggressively force separations between experimental groups. As a result, OPLS-DA is often used as an alternative method when PCA fails to expose group separation, but this practice is highly dangerous. Without rigorous validation, OPLS-DA can easily yield statistically unreliable group separation.

Methods: A Monte Carlo analysis of PCA group separations and OPLS-DA cross-validation metrics was performed on NMR datasets with statistically significant separations in scores-space. A linearly increasing amount of Gaussian noise was added to each data matrix followed by the construction and validation of PCA and OPLS-DA models.

Results: With increasing added noise, the PCA scores-space distance between groups rapidly decreased and the OPLS-DA cross-validation statistics simultaneously deteriorated. A decrease in correlation between the estimated loadings (added noise) and the true (original) loadings was also observed. While the validity of the OPLS-DA model diminished with increasing added noise, the group separation in scores-space remained basically unaffected.

Conclusion: Supported by the results of Monte Carlo analyses of PCA group separations and OPLS-DA cross-validation metrics, we provide practical guidelines and cross-validatory recommendations for reliable inference from PCA and OPLS-DA models.

Keywords: Chemometrics; Metabolomics; OPLS; PCA; PLS.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Relationships to OPLS-DA CV-ANOVA p values obtained through Monte Carlo simulation of the Mahalanobis distance (DM) between classes in PCA scores-space. Panels (A) and (B) hold results computed from the Coffees and Media datasets, respectively. The density of points in both panels is indicated by coloring, where red indicates high point density and blue indicates low density.
Fig. 2
Fig. 2
Relationships to OPLS-DA CV-ANOVA p values obtained through Monte Carlo simulation of correlation between OPLS-DA model predictive loadings given noisy data (p) and loadings obtained on the original data matrix (p0). Panels (A) and (B) hold results computed from the Coffees and Media datasets, respectively. The density of points in both panels is indicated by coloring, where red indicates high point density and blue indicates low density.
Fig. 3
Fig. 3
(A) Decrease of correlation between estimated loadings (p) and true loadings (p0) occurs as varying degrees of noise are added to the Coffees (red) and Media (blue) data matrices. Light shaded regions indicate confidence intervals of plus or minus one standard deviation from the mean correlation. A value of 1X additive noise corresponds to a noise standard deviation equaling 0.002 times the data matrix l2 norm. (B) Increase of p values from CV-ANOVA OPLS-DA validation as varying degrees of noise are added to the Coffees (red) and Media (blue) data matrices. Light shaded regions indicate confidence intervals of plus or minus one standard deviation from the median p value.
Fig. 4
Fig. 4
Comparison of representative PCA (A, C, E) and OPLS-DA (B, D, F) scores resulting from modeling the original Coffees data matrix (A, B), the 4X noisy data matrix (C, D) and the 20X noisy data matrix (E, F). Class ellipses represent the 95% confidence regions for class membership. CV-ANOVA p-values for the OPLS-DA model generated from the original data matrix, 4X and 20X noisy data matrix are 2.82x10−11, 2.99x10−4, and 1.97x10−1, respectively.

References

    1. Hecht SS. Human urinary carcinogen metabolites: biomarkers for investigating tobacco and cancer. Carcinogenesis. 2002;23(6):907–922. - PubMed
    1. Jolliffe IT. Principal component analysis. 2. Springer-Verlag; New York: 2002. p. 488.
    1. Lindon JC, Nicholson JK, Holmes E, Everett JR. Metabonomics: Metabolic processes studied by NMR spectroscopy of biofluids. Concepts in Magnetic Resonance. 2000;12(5):289–320.
    1. Wold S, Esbensen K, Geladi P. Principal Component Analysis. Chemometrics and Intelligent Laboratory Systems. 1987;2(1–3):37–52.
    1. Worley B, Powers R. Multivariate Analysis in Metabolomics. Current Metabolomics. 2013;1(1):92–107. - PMC - PubMed

LinkOut - more resources