Nonnegative matrix factorization for Text, Graph, and Hybrid Data Analytics

Series: Dissertation Defense
Time: Monday, March 12, 2018 - 10:00am for 1 hour (actually 50 minutes)
Location: Klaus 2108
Speaker: Rundong Du – Georgia Tech – rdu@gatech.edu
Organizer: Mohammad Ghomi

Constrained low rank approximation is a general framework for data analysis, which usually has the advantage of being simple, fast, scalable and domain general. One of the most known constrained low rank approximation method is nonnegative matrix factorization (NMF). This research studies the design and implementation of several variants of NMF for text, graph and hybrid data analytics. It will address challenges including solving new data analytics problems and improving the scalability of existing NMF algorithms. There are two major types of matrix representation of data: feature-data matrix and similarity matrix. Previous work showed successful application of standard NMF for feature-data matrix to areas such as text mining and image analysis, and Symmetric NMF (SymNMF) for similarity matrix to areas such as graph clustering and community detection. In this work, a divide-and-conquer strategy is applied to both methods to improve their time complexity from cubic growth with respect to the reduced low rank to linear growth, resulting in DC-NMF and HierSymNMF2 method. Extensive experiments on large scale real world data shows improved performance of these two methods.Furthermore, in this work NMF and SymNMF are combined into one formulation called JointNMF, to analyze hybrid data that contains both text content and connection structure information. Typical hybrid data where JointNMF can be applied includes paper/patent data where there are citation connections among content and email data where the sender/receipts relation is represented by a hypergraph and the email content is associated with hypergraph edges. An additional capability of the JointNMF is prediction of unknown network information which is illustrated using several real world problems such as citation recommendations of papers and activity/leader detection in organizations.The dissertation also includes brief discussions of relationship among different variants of NMF.

Georgia Institute of Technology College of Sciences

Search form