Transformers Provably Learn Two-Mixture of Linear Classification via Gradient Flow

Authors: Hongru Yang, Zhangyang Wang, Jason Lee, Yingbin Liang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a guidance for our theory, we first conduct experiments to observe the training dynamics of the transformer where we train all the weights simultaneously. Our experiment results show a clear stage-wise learning phenomenon where the neuron weights learn before the attention modules. We empirically show the difficulty of generalizing our analysis of the gradient flow dynamics to the case even when the number of mixtures equals three, although the transformer can still successfully learn such distribution.
Researcher Affiliation Academia Hongru Yang The University of Texas at Austin & Princeton University EMAIL Zhangyang Wang The University of Texas at Austin EMAIL Jason D. Lee Princeton University EMAIL Yingbin Liang The Ohio State University EMAIL
Pseudocode Yes Algorithm 1 Three-stage Training
Open Source Code No The paper does not provide any explicit statements about releasing source code for their methodology, nor does it include a link to a code repository.
Open Datasets Yes In our experiments, we use MNIST dataset and extract the images with label 1 and label 2 to play the role of classification signals.
Dataset Splits No The paper describes how the synthetic K-mixture of linear classification data is generated (Definition 2.1) and mentions specific parameters like K=2 and L=3 for the training dynamics. For the MNIST dataset, it states that images with labels 1 and 2 are extracted. However, it does not provide explicit training, validation, or test split percentages or counts for either the synthetic or real-world datasets, nor does it reference standard splits with citations.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU models, or memory specifications.
Software Dependencies No The paper does not specify any software dependencies with version numbers that would be necessary to replicate the experiments.
Experiment Setup Yes In our experiments, all the weights in the transformer are trained simultaneously via gradient descent with learning rate 0.1. Initialize w(0) = 0, W (0) K , W (0) Q N(0, ω2 1 m) and b to be a sufficiently small positive constant such as 1/2. The attention initialization scale satisfies ω < C < 1 for some small constant C.