Transformer Learns Optimal Variable Selection in Group-Sparse Classification
Authors: Chenyang Zhang, Xuran Meng, Yuan Cao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3. We conduct numerical experiments, empirically show that training loss will converge, and verify our conclusions regarding the optimization trajectories of trainable parameters. Specifically, the sparsity of the attention score matrix empirically demonstrates that one-layer transformers can effectively learn the optimal variable selection. Additionally, we transfer the pre-trained one-layer transformers to downstream tasks, and empirically show that it can achieve a good generalization performance with a small sample size. All these empirical observations back up our theoretical findings. |
| Researcher Affiliation | Academia | Chenyang Zhang , Xuran Meng , Yuan Cao The University of Hong Kong University of Michigan, Ann Arbor EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes mathematical formulations and theoretical proofs but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | In this section, we conduct experiments using the CIFAR-10 dataset, where each image has a shape of 3x32x32, representing three color channels (RGB). |
| Dataset Splits | Yes | For this experiment, we select two labels, Frog and Airplane, and use 500 images from each label. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used (e.g., GPU models, CPU types) for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers used for the experiments. |
| Experiment Setup | Yes | We set the learning rate η = 0.5 and train the models for 400 iterations. ... Both experiments use a sample size of 400, and the learning rate is set to 10^-3. ... The transformer model is initialized to 0, and we train it using a batch size of 64 and a learning rate of 10^-3. |