Benign Overfitting in Token Selection of Attention Mechanism
Authors: Keitaro Sakamoto, Issei Sato
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets. |
| Researcher Affiliation | Academia | 1Department of Computer Science, The University of Tokyo, Tokyo, Japan. Correspondence to: Keitaro Sakamoto <EMAIL>, Issei Sato <EMAIL>. |
| Pseudocode | No | The paper describes methods textually and mathematically but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | The code is available on Git Hub 1. 1https://github.com/keitaroskmt/ benign-attention |
| Open Datasets | Yes | We further conducted real-world experiments on image and natural language datasets for classification. For each task, we used the pre-trained Vi T (Dosovitskiy et al., 2021) and BERT (Devlin et al., 2018) models. ... 10-class image classification with MNIST (Le Cun et al., 2010) and CIFAR-10 (Krizhevsky et al., 2009), anomaly detection in medical image with Pneumonia MNIST and Breast MNIST (Yang et al., 2023), topic classification of text with AG-news (Zhang et al., 2015), and question type classification with TREC (Li & Roth, 2002). |
| Dataset Splits | Yes | Table 2 presents the training loss and test accuracy when varying the training size n. ... Training Size n 20 200 1000 ... we used the pre-trained Vi T (Dosovitskiy et al., 2021) and BERT (Devlin et al., 2018) models. ... We used datasets from various types: 10-class image classification with MNIST (Le Cun et al., 2010) and CIFAR-10 (Krizhevsky et al., 2009), anomaly detection in medical image with Pneumonia MNIST and Breast MNIST (Yang et al., 2023), topic classification of text with AG-news (Zhang et al., 2015), and question type classification with TREC (Li & Roth, 2002). For detailed descriptions of these datasets, please refer to Appendix F.2. |
| Hardware Specification | No | The paper does not explicitly describe the hardware (e.g., specific GPU or CPU models) used for running the experiments. |
| Software Dependencies | No | We prepared the pre-trained Vi T (Dosovitskiy et al., 2021) and BERT (Devlin et al., 2018) models using huggingface transformer library (Wolf et al., 2020). ... During the experiments, the AdamW optimizer (Loshchilov & Hutter, 2019) without weight decay was used... While software components like 'huggingface transformer library' and 'AdamW optimizer' are mentioned, specific version numbers for these libraries or other key software dependencies are not provided, only citations to the papers introducing them. |
| Experiment Setup | Yes | Specifically, we consider the setting with n = 20, T = 8, η = 0.2, ρ = 0.1 and α =5e 3, changing the value of the dimension d and the signal size µ 2. ... During the experiments, the AdamW optimizer (Loshchilov & Hutter, 2019) without weight decay was used with a learning rate of 5e 5, along with linear warmup and learning rate decay. |