Transformer Learns Optimal Variable Selection in Group-Sparse Classification

Authors: Chenyang Zhang, Xuran Meng, Yuan Cao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 3. We conduct numerical experiments, empirically show that training loss will converge, and verify our conclusions regarding the optimization trajectories of trainable parameters. Specifically, the sparsity of the attention score matrix empirically demonstrates that one-layer transformers can effectively learn the optimal variable selection. Additionally, we transfer the pre-trained one-layer transformers to downstream tasks, and empirically show that it can achieve a good generalization performance with a small sample size. All these empirical observations back up our theoretical findings.
Researcher Affiliation Academia Chenyang Zhang , Xuran Meng , Yuan Cao The University of Hong Kong University of Michigan, Ann Arbor EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes mathematical formulations and theoretical proofs but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets Yes In this section, we conduct experiments using the CIFAR-10 dataset, where each image has a shape of 3x32x32, representing three color channels (RGB).
Dataset Splits Yes For this experiment, we select two labels, Frog and Airplane, and use 500 images from each label.
Hardware Specification No The paper does not provide specific details about the hardware used (e.g., GPU models, CPU types) for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers used for the experiments.
Experiment Setup Yes We set the learning rate η = 0.5 and train the models for 400 iterations. ... Both experiments use a sample size of 400, and the learning rate is set to 10^-3. ... The transformer model is initialized to 0, and we train it using a batch size of 64 and a learning rate of 10^-3.