Text-classification methods and the mathematical theory of Principal Components

Series
Dissertation Defense
Time
Monday, April 8, 2019 - 12:10pm for 1.5 hours (actually 80 minutes)
Location
Skiles 202
Speaker
Jiangning Chen – Georgia Institute of Technology – jchen444@gatech.edu
Organizer
Jiangning Chen

We are going talk about three topics. First of all, Principal Components Analysis (PCA) as a dimension reduction technique. We investigate how useful it is for real life problems. The problem is that, often times the spectrum of the covariance matrix is wrongly estimated due to the ratio between sample space dimension over feature space dimension not being large enough. We show how to reconstruct the spectrum of the ground truth covariance matrix, given the spectrum of the estimated covariance for multivariate normal vectors. We then present an algorithm for reconstruction the spectrum in the case of sparse matrices related to text classification. 

In the second part, we concentrate on schemes of PCA estimators. Consider the problem of finding the least eigenvalue and eigenvector of ground truth covariance matrix, a famous classical estimator are due to Krasulina. We state the convergence proof of Krasulina for the least eigenvalue and corresponding eigenvector, and then find their convergence rate.

In the last part, we consider the application problem, text classification, in the supervised view with traditional Naive-Bayes method. We find out an updated Naive-Bayes method with a new loss function, which loses the unbiased property of traditional Naive-Bayes method, but obtains a smaller variance of the estimator. 

Committee:  Heinrich Matzinger (Advisor); Karim Lounici (Advisor); Ionel Popescu (school of math); Federico Bonetto (school of math); Xiaoming Huo (school of ISYE);