Benign Overfitting in Token Selection of Attention Mechanism

Authors: Keitaro Sakamoto, Issei Sato

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets.
Researcher Affiliation Academia 1Department of Computer Science, The University of Tokyo, Tokyo, Japan. Correspondence to: Keitaro Sakamoto <EMAIL>, Issei Sato <EMAIL>.
Pseudocode No The paper describes methods textually and mathematically but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes The code is available on Git Hub 1. 1https://github.com/keitaroskmt/ benign-attention
Open Datasets Yes We further conducted real-world experiments on image and natural language datasets for classification. For each task, we used the pre-trained Vi T (Dosovitskiy et al., 2021) and BERT (Devlin et al., 2018) models. ... 10-class image classification with MNIST (Le Cun et al., 2010) and CIFAR-10 (Krizhevsky et al., 2009), anomaly detection in medical image with Pneumonia MNIST and Breast MNIST (Yang et al., 2023), topic classification of text with AG-news (Zhang et al., 2015), and question type classification with TREC (Li & Roth, 2002).
Dataset Splits Yes Table 2 presents the training loss and test accuracy when varying the training size n. ... Training Size n 20 200 1000 ... we used the pre-trained Vi T (Dosovitskiy et al., 2021) and BERT (Devlin et al., 2018) models. ... We used datasets from various types: 10-class image classification with MNIST (Le Cun et al., 2010) and CIFAR-10 (Krizhevsky et al., 2009), anomaly detection in medical image with Pneumonia MNIST and Breast MNIST (Yang et al., 2023), topic classification of text with AG-news (Zhang et al., 2015), and question type classification with TREC (Li & Roth, 2002). For detailed descriptions of these datasets, please refer to Appendix F.2.
Hardware Specification No The paper does not explicitly describe the hardware (e.g., specific GPU or CPU models) used for running the experiments.
Software Dependencies No We prepared the pre-trained Vi T (Dosovitskiy et al., 2021) and BERT (Devlin et al., 2018) models using huggingface transformer library (Wolf et al., 2020). ... During the experiments, the AdamW optimizer (Loshchilov & Hutter, 2019) without weight decay was used... While software components like 'huggingface transformer library' and 'AdamW optimizer' are mentioned, specific version numbers for these libraries or other key software dependencies are not provided, only citations to the papers introducing them.
Experiment Setup Yes Specifically, we consider the setting with n = 20, T = 8, η = 0.2, ρ = 0.1 and α =5e 3, changing the value of the dimension d and the signal size µ 2. ... During the experiments, the AdamW optimizer (Loshchilov & Hutter, 2019) without weight decay was used with a learning rate of 5e 5, along with linear warmup and learning rate decay.