Nonlinear transformers can perform inference-time feature learning

Authors: Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, Taiji Suzuki

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that transformers pretrained by gradient-based optimization can perform inference-time feature learning, i.e., extract information of the target features β solely from test prompts... We conduct numerical experiments on synthetic data to compare the in-context learning algorithm implemented by nonlinear transformers against non-adaptive kernel methods.
Researcher Affiliation Academia 1The University of Tokyo, Tokyo, Japan 2RIKEN AIP, Tokyo, Japan 3Unversity of California, Berkeley 4New York University 5Flatiron Institute. Correspondence to: Naoki Nishikawa <EMAIL>, Yujin Song <EMAIL>.
Pseudocode Yes Algorithm 1 Gradient-based training of transformer Input : Learning rate η1, η2, regularization rate λ1, λ2, initialization scale α, temperature ρ
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. It mentions using a '6-layer GPT-2 model', which is a third-party tool.
Open Datasets No We conduct numerical experiments on synthetic data to compare the in-context learning algorithm... For each test task t, we generate data as xt 1, . . . , xt Ntest, x N(0, Id), βt Unif(Sd 1)...
Dataset Splits No For each test task t, we generate data as xt 1, . . . , xt Ntest, x N(0, Id), βt Unif(Sd 1) (i.e., r = d),with yt i = σ ( βt, xt i ) for i [N]. We compare the performance of two approaches... The paper describes a process of generating synthetic data for each task, rather than using a fixed dataset with explicit train/test/validation splits.
Hardware Specification No The paper mentions training a '6-layer GPT-2 model' and performing 'numerical experiments' but does not specify any particular hardware details such as GPU models, CPU types, or memory.
Software Dependencies No We train a 6-layer GPT-2 model (Radford et al., 2019)... We pretrain the GPT-2 model using the Adam (Kingma & Ba, 2015) optimizer... The paper mentions software components like GPT-2 and Adam optimizer, but does not provide specific version numbers for these or other relevant libraries.
Experiment Setup Yes We pretrain the GPT-2 model using the Adam (Kingma & Ba, 2015) optimizer with learning rate 0.0001 on the mean-squared loss calculated over all the positions.