Seminars and Colloquia by Series

Exploring Conditional Computation in Transformer models

Series
Applied and Computational Mathematics Seminar
Time
Monday, September 30, 2024 - 14:00 for 1 hour (actually 50 minutes)
Location
Skiles 005 and ONLINE
Speaker
Xin WangGoogle Research

Transformer (Vaswani et al. 2017) architecture is a popular deep learning architecture that today comprises the foundation for most tasks in natural language processing and forms the backbone of all the current state-of-the-art language models. Central to its success is the attention mechanism, which allows the model to weigh the importance of different input tokens. However, Transformers can become computationally expensive, especially for large-scale tasks. To address this, researchers have explored techniques for conditional computation, which selectively activate parts of the model based on the input. In this talk, we present two case studies of conditional computation in Transformer models. In the first case, we examine the routing mechanism in the Mixture-of-Expert (MoE) Transformer models, and show theoretical and empirical evidence that the router’s ability to route intelligently confers a significant advantage to MoE models. In the second case, we introduce Alternating Updates (AltUp), a method to take advantage of increased residual stream width in the Transformer models without increasing the computation cost.

 

Speaker's brief introduction: Xin Wang is a research engineer in the Algorithms team at Google Research. Xin finished his PhD in Mathematics at Georgia Institute of Technology before coming to Google. Xin's research interests include efficient computing, memory mechanism for machine learning, and optimization.

The talk will be presented online at

 https://gatech.zoom.us/j/93087689904

Existence of optimal flat ribbons

Series
Geometry Topology Seminar
Time
Monday, September 30, 2024 - 14:00 for 1 hour (actually 50 minutes)
Location
Skiles 006
Speaker
Matteo RaffaelliGeorgia Tech

We revisit the classical problem of constructing a developable surface along a given Frenet curve $\gamma$ in space. First, we generalize a well-known formula, introduced in the literature by Sadowsky in 1930, for the Willmore energy of the rectifying developable of $\gamma$ to any (infinitely narrow) flat ribbon along the same curve. Then we apply the direct method of the calculus of variations to show the existence of a flat ribbon along $\gamma$ having minimal bending energy. Joint work with Simon Blatt.

Shimura operators and interpolation polynomials

Series
Algebra Seminar
Time
Monday, September 30, 2024 - 13:00 for 1 hour (actually 50 minutes)
Location
Skiles 005
Speaker
Songhao ZhuGeorgia Tech

Please Note: This talk starts at 1pm rather than the usual time.

The late Goro Shimura proposed a question regarding certain invariant differential operators on a Hermitian symmetric space. This was answered by Sahi and Zhang by showing that the Harish-Chandra images of these namesake operators are specializations of Okounkov's BC-symmetric interpolation polynomials. We prove, in the super setting, that the Harish-Chandra images of super Shimura operators are specializations of certain BC-supersymmetric interpolation polynomials due to Sergeev and Veselov. Similar questions include the Capelli eigenvalue problems which are generalized to the quantum and/or super settings. This talk is based a joint work with Siddhartha Sahi.

Efficient and optimal high-dimensional planar assignments

Series
Combinatorics Seminar
Time
Friday, September 27, 2024 - 15:15 for
Location
Skiles 005
Speaker
Michael SimkinMassachusetts Institute of Technology

The ($2$-dimensional) assignment problem is to find, in an edge weighted bipartite graph, an assignment (i.e., a perfect matching) of minimum total weight. Efficient algorithms for this problem have been known since the advent of modern algorithmic analysis. Moreover, if the edge weights are i.i.d. Exp(1) random variables and the host graph is complete bipartite, seminal results of Aldous state that the expected weight of the optimal assignment tends to $\zeta(2)$.

 

We consider high-dimensional versions of the random assignment problem. Here, we are given a cost array $M$, indexed by $[n]^k$, and with i.i.d. Exp(1) entries. The objective is to find a ${0,1}$-matrix A that minimizes $\sum_{x \in [n]^k} A_xM_x$, subject to the constraint that every axis-parallel line in A contains exactly one 1. This is the planar assignment problem, and when $k=2$ is equivalent to the usual random assignment problem. We prove that the expected cost of an optimal assignment is $\Theta(n^{k-2})$. Moreover, we describe a randomized algorithm that finds such an assignment with high probability. The main tool is iterative absorption, as developed by Glock, Kühn, Lo, and Osthus. The results answer questions of Frieze and Sorkin. The algorithmic result is in contrast to the axial assignment problem (in which A contains exactly one 1 in each axis-parallel co-dimension 1 hyperplane). For the latter, the best known bounds (which are due to Frankston, Kahn, Narayanan, and Park) exploit the connection between ``spread'' distributions and optimal assignments. Due to this reliance, no efficient algorithm is known.

 

Joint work with Ashwin Sah and Mehtaab Sawhney.

Near-optimal estimation on the union of shift-invariant subspaces

Series
Stochastics Seminar
Time
Thursday, September 26, 2024 - 15:30 for 1 hour (actually 50 minutes)
Location
Skiles 006
Speaker
Dmitrii OstrovskiiGeorgia Tech

In the 1990s, Arkadi Nemirovski asked the following question:

How hard is it to estimate a solution to unknown homogeneous linear difference equation with constant coefficients of order S, observed in the Gaussian noise on [0,N]?

The class of all such solutions, or "signals," is parametric---described by 2S complex parameters---but extremely rich: it includes the weighted sums of S exponentials, polynomials of degree S, harmonic oscillations with S arbitrary frequencies, and their algebraic combinations. Geometrically, this class is the union of all S-dimensional shift-invariant subspaces of the space of two-sided sequences, and of interest is the minimax risk on it with respect to the mean-squared error on [0,N]. I will present a recent result that shows this minimax risk to be O( S log(N) log(S)^2 ), improving over the state of the art by a polynomial in S factor, and coming within an O( log(S)^2 ) factor from the lower bound. It relies upon an approximation-theoretic construction related to minimal-norm interpolation over shift-invariant subspaces, in the spirit of the Landau-Kolmogorov problem, that I shall present in some detail. Namely, we will see that any shift-invariant subspace admits a bounded-support reproducing kernel whose spectrum has nearly the smallest possible Lp-energies for all p ≥ 1 at once.

An ergodic theorem in the Gaussian integer setting

Series
Analysis Seminar
Time
Wednesday, September 25, 2024 - 14:00 for 1 hour (actually 50 minutes)
Location
Skiles 005
Speaker

<<>>

We discuss the Pointwise Ergodic Theorem for the Gaussian divisor function $d(n)$, that is, for a measure preserving $\mathbb Z [i]$ action $T$, the ergodic averages weighted by the divisor function converge pointwise for all functions in $L^p$, for $p>1$.  We obtain improving and sparse bounds for these averages.

Existence of stationary measures for partially damped SDEs with generic, Euler-type nonlinearities

Series
PDE Seminar
Time
Tuesday, September 24, 2024 - 15:30 for 1 hour (actually 50 minutes)
Location
Skiles 006
Speaker
Keagan CallisGeorgia Tech

We study nonlinear energy transfer and the existence of stationary measures in a class of degenerately forced SDEs on R^d with a quadratic, conservative nonlinearity B(x, x) constrained to possess various properties common to finite-dimensional fluid models and a linear damping term −Ax that acts only on a proper subset of phase space in the sense that dim(kerA) ≫ 1. Existence of a stationary measure is straightforward if kerA = {0}, but when the kernel of A is nontrivial a stationary measure can exist only if the nonlinearity transfers enough energy from the undamped modes to the damped modes. We develop a set of sufficient dynamical conditions on B that guarantees the existence of a stationary measure and prove that they hold “generically” within our constraint class of nonlinearities provided that dim(kerA) < 2d/3 and the stochastic forcing acts directly on at least two degrees of freedom. We also show that the restriction dim(kerA) < 2d/3 can be removed if one allows the nonlinearity to change by a small amount at discrete times. In particular, for a Markov chain obtained by evolving our SDE on approximately unit random time intervals and slightly perturbing the nonlinearity within our constraint class at each timestep, we prove that there exists a stationary measure whenever just a single mode is damped.

Degree-boundedness (Xiying Du)

Series
Graph Theory Seminar
Time
Tuesday, September 24, 2024 - 15:30 for 1 hour (actually 50 minutes)
Location
Skiles 005
Speaker
Xiying DuGeorgia Tech

We say a class of graphs $\mathcal{F}$ is degree-bounded if there exists a function $f$ such that $\delta(G)\le f(\tau(G))$ for every $G\in\mathcal{F}$, where $\delta(G)$ denotes the minimum degree of $G$ and $\tau(G)$ is the biclique number of $G$, that is, the largest integer $t$ such that $G$ contains $K_{t,t}$ as a subgraph. Such a function $f$ is called a degree-bounding function for $\mathcal{F}$.

In joint work with Ant\'onio Gir\~ao, Zach Hunter, Rose McCarty and Alex Scott, we proved that every hereditary degree-bounded class $\mathcal{F}$ has a degree-bounding function that is singly exponential in the biclique number $\tau$. A more recent result by Ant\'onio Gir\~ao and Zach Hunter improved this bound to a polynomial function in $\tau$. In this talk, I will discuss examples and the recent results on degree-boundedness. 

Finding Cheeger cuts via 1-Laplacian of graphs

Series
Applied and Computational Mathematics Seminar
Time
Monday, September 23, 2024 - 14:00 for 1 hour (actually 50 minutes)
Location
Skiles 005
Speaker
Wei ZhuUniversity of Alabama at Tuscaloosa

Finding Cheeger cuts of graphs is an NP-hard problem, and one often resorts to approximate solutions. In the literature, spectral graph theory provides the most popular approaches for obtaining such approximate solutions. Recently, K.C. Chang introduced a novel nonlinear spectral graph theory and proved that the seek of Cheeger cuts is equivalent to solving a constrained optimization problem. However, this resulting optimization problem is also very challenging as it involves a non-differentiable function over a non-convex set that is composed of simplex cells of different dimensions. In this talk, we will discuss an ADMM algorithm for solving this optimization problem and provide some convergence analysis. Experimental results will be presented for typical graphs, including Petersen's graph and Cockroach graphs, the well-known Zachary karate club graph, and some preliminary applications in material sciences.

Pages