. 2009 Nov 13;367(1906):4295-312.

doi: 10.1098/rsta.2009.0161.

On landmark selection and sampling in high-dimensional data analysis

Mohamed-Ali Belabbas¹, Patrick J Wolfe

Affiliations

PMID: 19805446
PMCID: PMC2865880
DOI: 10.1098/rsta.2009.0161

On landmark selection and sampling in high-dimensional data analysis

Mohamed-Ali Belabbas et al. Philos Trans A Math Phys Eng Sci. 2009.

. 2009 Nov 13;367(1906):4295-312.

doi: 10.1098/rsta.2009.0161.

Authors

Mohamed-Ali Belabbas¹, Patrick J Wolfe

Affiliation

¹ Statistics and Information Sciences Laboratory, Harvard University, Oxford Street, Cambridge, MA 02138, USA.

PMID: 19805446
PMCID: PMC2865880
DOI: 10.1098/rsta.2009.0161

Abstract

In recent years, the spectral analysis of appropriately defined kernel matrices has emerged as a principled way to extract the low-dimensional structure often prevalent in high-dimensional data. Here, we provide an introduction to spectral methods for linear and nonlinear dimension reduction, emphasizing ways to overcome the computational limitations currently faced by practitioners with massive datasets. In particular, a data subsampling or landmark selection process is often employed to construct a kernel based on partial information, followed by an approximate spectral analysis termed the Nyström extension. We provide a quantitative framework to analyse this procedure, and use it to demonstrate algorithmic performance bounds on a range of practical approaches designed to optimize the landmark selection process. We compare the practical implications of these bounds by way of real-world examples drawn from the field of computer vision, whereby low-dimensional manifold structure is shown to emerge from high-dimensional video data streams.

PubMed Disclaimer

Figures

**Figure 1.**
PCA, with measurements in (a) expressed in (b) in terms of the two-dimensional subspace that best explains their variability. (a) An example set of centred measurements, with projections onto each coordinate axis also shown. (b) PCA yields a plane indicating the directions of greatest variability of the data.

**Figure 2.**
(a)‘Fishbowl’ data (sphere with top cap removed). Nonlinear dimension reduction, with contrasting embeddings of the data of (a) shown. (b) The two-dimensional linear embedding via PCA yields an overlap of points of different colour, indicating a failure to recover the nonlinear structure of the data. (c and d) The embeddings obtained by diffusion maps and Laplacian eigenmaps, respectively; each of these methods successfully recovers the nonlinear structure of the original dataset, correctly ‘unfolding’ it in two dimensions.

**Figure 3.**
Diffusion maps embedding of the video fuji.avi from the Honda/UCSD database (Lee *et al.* 2003, 2005), implemented in the pixel domain after data normalization, σ=100.

**Figure 4.**
Average normalized approximation error of the Nyström reconstruction of the diffusion maps kernel obtained from the video of figure 3 using different subset selection methods. Sampling according to the determinant yields overall the best performance. Blue circles, uniform sampling (s=0); red squares, determinantal maximization (); black diamonds, determinantal sampling (s=1).

formula image — **Figure 4.**
Average normalized approximation error of the Nyström reconstruction of the diffusion maps kernel obtained from the video of figure 3 using different subset selection methods. Sampling according to the determinant yields overall the best performance. Blue circles, uniform sampling (s=0); red squares, determinantal maximization (); black diamonds, determinantal sampling (s=1).

**Figure 5.**
(a) Exact, (b) determinant sampling and (c) uniform sampling diffusion maps embedding of a video showing movement at an uneven speed, implemented in the pixel domain with σ=100 after data normalization. Note that the linear structure of this manifold is recovered almost exactly by the determinantal sampling scheme, whereas it is lost in the case of uniform sampling, where the curve folds over onto itself.

**Figure 6.**
Average normalized approximation error of the Nyström reconstruction of the diffusion maps kernel obtained from the video of figure 5 using different subset selection methods. Sampling uniformly is consistently outperformed by the other methods. Blue circles, uniform sampling (s=0); red squares, determinantal maximization (); black diamonds, determinantal sampling (s=1).

See this image and copyright information in PMC

References

1. Belabbas M.-A., Wolfe P. J.2009Spectral methods in machine learning and new strategies for very large data sets Proc. Natl Acad. Sci. USA 106369–374(doi:1073/pnas.0810600105) - PMC - PubMed
1. Belkin M., Niyogi P.2003Laplacian eigenmaps for dimensionality reduction and data representation Neural Comput. 151373–1396(doi:1162/089976603321780317)
1. Blackburn J., Ribeiro E.2007Human motion recognition using isomap and dynamic time warping Human motion–understanding, modeling, capture and animation Elgammal A., Rosenhahn B., Klette R. 285–298. Lecture Notes in Computer Science Berlin, Germany: Springer
1. Coifman R. R., Lafon S., Lee A. B., Maggioni M., Nadler B., Warner F., Zucker S. W.2005Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps Proc. Natl Acad. Sci. USA 1027426–7431(doi:1073/pnas.0500334102) - PMC - PubMed
1. Deshpande A., Rademacher L., Vempala S., Wang G.2006Matrix approximation and projective clustering via volume sampling Proc. 17th Annual ACM-SIAM Symp. on Discrete Algorithms, Miami, FL 1117–1126. See http://siam.org/meetings/da06/index.htm

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

On landmark selection and sampling in high-dimensional data analysis

Affiliation

On landmark selection and sampling in high-dimensional data analysis

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources