Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Nov 13;367(1906):4295-312.
doi: 10.1098/rsta.2009.0161.

On landmark selection and sampling in high-dimensional data analysis

Affiliations

On landmark selection and sampling in high-dimensional data analysis

Mohamed-Ali Belabbas et al. Philos Trans A Math Phys Eng Sci. .

Abstract

In recent years, the spectral analysis of appropriately defined kernel matrices has emerged as a principled way to extract the low-dimensional structure often prevalent in high-dimensional data. Here, we provide an introduction to spectral methods for linear and nonlinear dimension reduction, emphasizing ways to overcome the computational limitations currently faced by practitioners with massive datasets. In particular, a data subsampling or landmark selection process is often employed to construct a kernel based on partial information, followed by an approximate spectral analysis termed the Nyström extension. We provide a quantitative framework to analyse this procedure, and use it to demonstrate algorithmic performance bounds on a range of practical approaches designed to optimize the landmark selection process. We compare the practical implications of these bounds by way of real-world examples drawn from the field of computer vision, whereby low-dimensional manifold structure is shown to emerge from high-dimensional video data streams.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
PCA, with measurements in (a) expressed in (b) in terms of the two-dimensional subspace that best explains their variability. (a) An example set of centred measurements, with projections onto each coordinate axis also shown. (b) PCA yields a plane indicating the directions of greatest variability of the data.
Figure 2.
Figure 2.
(a)‘Fishbowl’ data (sphere with top cap removed). Nonlinear dimension reduction, with contrasting embeddings of the data of (a) shown. (b) The two-dimensional linear embedding via PCA yields an overlap of points of different colour, indicating a failure to recover the nonlinear structure of the data. (c and d) The embeddings obtained by diffusion maps and Laplacian eigenmaps, respectively; each of these methods successfully recovers the nonlinear structure of the original dataset, correctly ‘unfolding’ it in two dimensions.
Figure 3.
Figure 3.
Diffusion maps embedding of the video fuji.avi from the Honda/UCSD database (Lee et al. 2003, 2005), implemented in the pixel domain after data normalization, σ=100.
Figure 4.
Figure 4.
Average normalized approximation error of the Nyström reconstruction of the diffusion maps kernel obtained from the video of figure 3 using different subset selection methods. Sampling according to the determinant yields overall the best performance. Blue circles, uniform sampling (s=0); red squares, determinantal maximization (formula image); black diamonds, determinantal sampling (s=1).
Figure 5.
Figure 5.
(a) Exact, (b) determinant sampling and (c) uniform sampling diffusion maps embedding of a video showing movement at an uneven speed, implemented in the pixel domain with σ=100 after data normalization. Note that the linear structure of this manifold is recovered almost exactly by the determinantal sampling scheme, whereas it is lost in the case of uniform sampling, where the curve folds over onto itself.
Figure 6.
Figure 6.
Average normalized approximation error of the Nyström reconstruction of the diffusion maps kernel obtained from the video of figure 5 using different subset selection methods. Sampling uniformly is consistently outperformed by the other methods. Blue circles, uniform sampling (s=0); red squares, determinantal maximization (formula image); black diamonds, determinantal sampling (s=1).

References

    1. Belabbas M.-A., Wolfe P. J.2009Spectral methods in machine learning and new strategies for very large data sets Proc. Natl Acad. Sci. USA 106369–374(doi:1073/pnas.0810600105) - PMC - PubMed
    1. Belkin M., Niyogi P.2003Laplacian eigenmaps for dimensionality reduction and data representation Neural Comput. 151373–1396(doi:1162/089976603321780317)
    1. Blackburn J., Ribeiro E.2007Human motion recognition using isomap and dynamic time warping Human motion–understanding, modeling, capture and animation Elgammal A., Rosenhahn B., Klette R. 285–298. Lecture Notes in Computer Science Berlin, Germany: Springer
    1. Coifman R. R., Lafon S., Lee A. B., Maggioni M., Nadler B., Warner F., Zucker S. W.2005Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps Proc. Natl Acad. Sci. USA 1027426–7431(doi:1073/pnas.0500334102) - PMC - PubMed
    1. Deshpande A., Rademacher L., Vempala S., Wang G.2006Matrix approximation and projective clustering via volume sampling Proc. 17th Annual ACM-SIAM Symp. on Discrete Algorithms, Miami, FL 1117–1126. See http://siam.org/meetings/da06/index.htm

Publication types

LinkOut - more resources