reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Nonlinear transformers can perform inference-time feature learning

Authors: Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, Taiji Suzuki

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prove that transformers pretrained by gradient-based optimization can perform inference-time feature learning, i.e., extract information of the target features β solely from test prompts... We conduct numerical experiments on synthetic data to compare the in-context learning algorithm implemented by nonlinear transformers against non-adaptive kernel methods.
Researcher Affiliation	Academia	1The University of Tokyo, Tokyo, Japan 2RIKEN AIP, Tokyo, Japan 3Unversity of California, Berkeley 4New York University 5Flatiron Institute. Correspondence to: Naoki Nishikawa <EMAIL>, Yujin Song <EMAIL>.
Pseudocode	Yes	Algorithm 1 Gradient-based training of transformer Input : Learning rate η1, η2, regularization rate λ1, λ2, initialization scale α, temperature ρ
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. It mentions using a '6-layer GPT-2 model', which is a third-party tool.
Open Datasets	No	We conduct numerical experiments on synthetic data to compare the in-context learning algorithm... For each test task t, we generate data as xt 1, . . . , xt Ntest, x N(0, Id), βt Unif(Sd 1)...
Dataset Splits	No	For each test task t, we generate data as xt 1, . . . , xt Ntest, x N(0, Id), βt Unif(Sd 1) (i.e., r = d),with yt i = σ ( βt, xt i ) for i [N]. We compare the performance of two approaches... The paper describes a process of generating synthetic data for each task, rather than using a fixed dataset with explicit train/test/validation splits.
Hardware Specification	No	The paper mentions training a '6-layer GPT-2 model' and performing 'numerical experiments' but does not specify any particular hardware details such as GPU models, CPU types, or memory.
Software Dependencies	No	We train a 6-layer GPT-2 model (Radford et al., 2019)... We pretrain the GPT-2 model using the Adam (Kingma & Ba, 2015) optimizer... The paper mentions software components like GPT-2 and Adam optimizer, but does not provide specific version numbers for these or other relevant libraries.
Experiment Setup	Yes	We pretrain the GPT-2 model using the Adam (Kingma & Ba, 2015) optimizer with learning rate 0.0001 on the mean-squared loss calculated over all the positions.