Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov;84(5):2000-2031.
doi: 10.1111/rssb.12550. Epub 2022 Nov 20.

High-dimensional principal component analysis with heterogeneous missingness

Affiliations

High-dimensional principal component analysis with heterogeneous missingness

Ziwei Zhu et al. J R Stat Soc Series B Stat Methodol. 2022 Nov.

Abstract

We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In a simple, homogeneous observation model, we show that an existing observed-proportion weighted (OPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, which exhibits an interesting phase transition. However, deeper investigation reveals that, particularly in more realistic settings where the observation probabilities are heterogeneous, the empirical performance of the OPW estimator can be unsatisfactory; moreover, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method, which we call primePCA, that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the OPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. We prove that the error of primePCA converges to zero at a geometric rate in the noiseless case, and when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios, including settings where the data are not Missing Completely At Random.

Keywords: heterogeneous missingness; high‐dimensional statistics; iterative projections; missing data; principal component analysis.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
An illustration of the two steps of a single iteration of the primePCA algorithm with d=3 and K=1. Black dots represent fully observed data points, while vertical dotted lines that emanate from them give an indication of their x3 coordinate values, as well as their projections onto the x1x2 plane. The x1 coordinate of the orange data point and the x2 coordinate of the blue data point are unobserved, so the true observations lie on the respective solid lines through those points (which are parallel to the relevant axes). Starting from an input estimate of VK (left), given by the black arrow, we impute the missing coordinates as the closest points on the coloured lines to VK (middle), and then obtain an updated estimate of VK as the leading right singular vector of the imputed data matrix (right, with the old estimate in grey). [Colour figure can be viewed at wileyonlinelibrary.com]
FIGURE 2
FIGURE 2
Estimates of 𝔼L(V^Kprime,VK) for various choices of σ under (H1) in the noiseless setting of Section 4.1 (left) and (H2) in the noisy setting of Section 4.2 with ν=20 (right)
FIGURE 3
FIGURE 3
Logarithms of the average Frobenius norm sinΘ error of primePCA and softImpute under various heterogeneity levels of missingness in absence of noise. The four rows of plots above, from the top to bottom, correspond to (H1), (H2), (H3) and (H4). [Colour figure can be viewed at wileyonlinelibrary.com]
FIGURE 4
FIGURE 4
Leading eigenvalues of ^y
FIGURE 5
FIGURE 5
Plots of the first two principal components V^2prime (left) and the associated scores {u^i}i=1n (right) [Colour figure can be viewed at wileyonlinelibrary.com]

References

    1. Anderson, T.W. (1957) Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. Journal of the American Statistical Association, 52, 200–203.
    1. Beaton, A.E. (1964) The use of special matrix operators in statistical calculus. ETS Research Bulletin Series, 2, i–222.
    1. Belloni, A. , Rosenbaum, M. & Tsybakov, A.B. (2017) Linear and conic programming estimators in high dimensional errors‐in‐variables models. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 79, 939–956.
    1. Cai, T.T. , Ma, Z. & Wu, Y. (2013) Sparse PCA: optimal rates and adaptive estimation. The Annals of Statistics, 41, 3074–3110.
    1. Cai, T.T. & Zhang, A. (2016) Minimax rate‐optimal estimation of high‐dimensional covariance matrices with incomplete data. The Journal of Multivariate Analysis, 150, 55–74. - PMC - PubMed