Spectrum Reconstruction Technique and Improved Naive Bayes Models for Text Classification Problems

Series
Dissertation Defense
Time
Thursday, April 16, 2020 - 2:00pm for 1 hour (actually 50 minutes)
Location
Bluejeans Meeting 866242745
Speaker
Zhibo Dai – Georgia Tech – zdai37@gatech.edu
Organizer
Zhibo (Roger) Dai

My thesis studies two topics. In the first part, we study the spectrum reconstruction technique. As is known to all, eigenvalues play an important role in many research fields and are foundation to many practical techniques such like PCA (Principal Component Analysis). We believe that related algorithms should perform better with more accurate spectrum estimation. There was an approximation formula proposed by Prof. Matzinger. However, they didn't give any proof. In our research, we show why the formula works. And when both number of features and dimension of space go to infinity, we find the order of error for the approximation formula, which is related to a constant C-the ratio of dimension of space and number of features.

In the second part, we focus on some applications of Naive Bayes models in text classification problems. Especially we focus on two special situations: 1) there is insufficient data for model training; 2) partial labeling problem. We choose Naive Bayes as our base model and do some improvement on the model to achieve better performance in those two situations. To improve model performance and to utilize as many information as possible, we introduce a correlation factor, which somehow relaxes the conditional independence assumption of Naive Bayes. The new estimates are biased estimation compared to the traditional Naive Bayes estimate, but have much smaller variance, which give us a better prediction result.