Exploring Conditional Computation in Transformer models
- Series
- Applied and Computational Mathematics Seminar
- Time
- Monday, September 30, 2024 - 14:00 for 1 hour (actually 50 minutes)
- Location
- Skiles 005 and ONLINE
- Speaker
- Xin Wang – Google Research – xinwangmath@gmail.com
Transformer (Vaswani et al. 2017) architecture is a popular deep learning architecture that today comprises the foundation for most tasks in natural language processing and forms the backbone of all the current state-of-the-art language models. Central to its success is the attention mechanism, which allows the model to weigh the importance of different input tokens. However, Transformers can become computationally expensive, especially for large-scale tasks. To address this, researchers have explored techniques for conditional computation, which selectively activate parts of the model based on the input. In this talk, we present two case studies of conditional computation in Transformer models. In the first case, we examine the routing mechanism in the Mixture-of-Expert (MoE) Transformer models, and show theoretical and empirical evidence that the router’s ability to route intelligently confers a significant advantage to MoE models. In the second case, we introduce Alternating Updates (AltUp), a method to take advantage of increased residual stream width in the Transformer models without increasing the computation cost.
Speaker's brief introduction: Xin Wang is a research engineer in the Algorithms team at Google Research. Xin finished his PhD in Mathematics at Georgia Institute of Technology before coming to Google. Xin's research interests include efficient computing, memory mechanism for machine learning, and optimization.
The talk will be presented online at