Seminar by Kaushik Sinha
Unsupervised Learning in High Dimensions
Kaushik Sinha
Univ. of California, San Diego
Date: Wednesday, July 11th, 2012
Time: 11AM
Venue: CS102.
Abstract:
In traditional statistical data analysis, observations are thought to be instances of particular phenomena. In most of the cases, these observations being a vector of values, are measured on several variables (e.g. blood pressure, weight, height, etc). As a result of tremendous progress in science and technology and especially sensing devices over the last couple of decades, most of the datasets that we deal with these days are very high dimensional. While availability of such high dimensional datasets is abundant, the labels of individual instances are not as many, e.g., we see a plethora of images on the Internet but only few of them are tagged/labeled indicating specifically if the image is that of a cat or that of a dog. Thus, a vast majority of the data analysis task involving high dimensional data is unsupervised (without any label information). In this talk, I will focus on some of the basic unsupervised learning problems, in high dimensions, namely: a) clustering, b) nearest-neighbor search, c) sequence/time-series analysis, and d) dimension reduction. I will very briefly describe some of our recent results/on-going work towards developing provably accurate algorithms for the above problems.
In the second part of the talk, I will describe the problem of learning High Dimensional Gaussian mixture model in details. The question of polynomial learnability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. I will talk about our recent result that resolves the question of polynomial learnability for Gaussian mixtures in high dimension with an arbitrary fixed number of components. The result on learning Gaussian mixtures relies on an analysis of distributions belonging to what we call "polynomial families" in low dimension. These families are characterized by their moments being polynomial in parameters and include almost all common probability distributions as well as their mixtures and products. Using tools from real algebraic geometry, we show that parameters of any distribution belonging to such a family can be learned in polynomial time and using a polynomial number of sample points. The result on learning polynomial families is quite general and is of independent interest. To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a high-dimensional mixture to a polynomial number of parameter estimation problems in low dimension. Combining this reduction with the results on polynomial families yields our result on learning arbitrary Gaussian mixtures in high dimensions.