- Series
- High-Dimensional Phenomena in Statistics and Machine Learning Seminar
- Time
- Friday, July 6, 2012 - 3:05pm for 1.5 hours (actually 80 minutes)
- Location
- Skiles 005
- Speaker
- Mahdi Soltanolkotabi – Stanford University
- Organizer
- Karim Lounici
One of the most fundamental steps in data analysis and dimensionality
reduction consists of approximating a given dataset by a single
low-dimensional subspace, which is classically achieved via Principal
Component Analysis (PCA). However, in many applications, the data often lie
near a union of low-dimensional subspaces, reflecting the multiple
categories or classes a set of observations may belong to. In this talk we
discuss the problem of clustering a collection of unlabeled data points
assumed to lie near a union of lower dimensional planes. Simply stated the
task is to assign each data point to a cluster so as to recover all the
hidden subspaces. As is common in computer vision or unsupervised learning
applications, we do not know in advance how many subspaces there are nor do
we have any information about their dimensions. We present a novel geometric
analysis of an algorithm named sparse subspace clustering (SSC), which
significantly broadens the range of problems where it is provably effective.
For instance, we show that SSC can recover multiple subspaces, each of
dimension comparable to the ambient dimension. We also show that SSC can
correctly cluster data points even when the subspaces of interest intersect.
Further, we develop an extension of SSC that succeeds when the data set is
corrupted with possibly overwhelmingly many outliers. Underlying our
analysis are clear geometric insights, which may bear on other sparse
recovery problems. We will also demonstrate the effectiveness of these
methods by various numerical studies.