Why are the logits of trained models distorted? A theory of overfitting for imbalanced classification

Series
Stochastics Seminar
Time
Thursday, March 6, 2025 - 3:30pm for 1 hour (actually 50 minutes)
Location
Skiles 006
Speaker
Yiqiao Zhong – University of Wisconsin–Madison – yiqiao.zhong@wisc.eduhttps://pages.stat.wisc.edu/~zhong35/index.html
Organizer
Mayya Zhilova

Data imbalance is a fundamental challenge in data analysis, where minority classes account for a small fraction of the training data compared to majority classes. Many existing techniques attempt to compensate for the underrepresentation of minority classes, which are often critical in applications such as rare disease detection and anomaly detection. Notably, in empirical deep learning, the large model size exacerbates the issue. However, despite extensive empirical heuristics, the statistical foundations of these methods remain underdeveloped, which poses an issue to the reliability of these machine learning models.

In this talk, I will examine imbalanced classification problems in high dimensions, focusing on support vector machine (SVMs) and logistic regression. I will introduce a "truncation" phenomenon---which we verifed across single-cell tabular data, image data, and text data---where overfitting in high dimensions distorts the distribution of logits on training data. I will provide a theoretical foundation by characterizing the asymptotic distribution via a variational formulation. This analysis formalizes the intuition that overfitting disproportionately harms minority classes and reveals how margin rebalancing---a widely used deep learning heuristic---mitigates data imbalance. As a consequence, the theory offers both qualitative and quantitative insights into generalization errors and uncertainty measures such as calibration.

This talk is based on a joint work with Jingyang Lyu (3rd-year Stats PhD student) and Kangjie Zhou (Columbia Statistics): arXiv:2502.11323.