reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Provable In-Context Vector Arithmetic via Retrieving Task Concepts

Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Qingfu Zhang, Hau-San Wong, Taiji Suzuki

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical simulations corroborate our theoretical insights.
Researcher Affiliation	Academia	1Department of Computer Science, City University of Hong Kong, Hong Kong SAR 2Center for Advanced Intelligence Project, RIKEN, Japan 3School of Mathematics and Statistics, The University of Sydney, Australia 4Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A STAR), Singapore 5Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A STAR), Singapore 6College of Computing and Data Science, Nanyang Technological University, Singapore 7Department of Mathematical Informatics, The University of Tokyo, Japan. Correspondence to: Wei Huang <EMAIL>, Hau-San Wong <EMAIL>.
Pseudocode	Yes	Algorithm 1 Training algorithm
Open Source Code	No	The paper does not contain any explicit statement about open-sourcing their code or a repository link.
Open Datasets	No	In this section, we present our data modeling based on the observations of task vector arithmetic in factual-recall ICL illustrated in Figure 1. We found the near-orthogonal properties in Figure 1 coincide with Park et al. (2025), which suggests that LLMs encode high- and low-level concepts in an approximately orthogonal manner. Specifically, we treat the task vector as a high-level concept representation, while orthogonal components represent task-specific low-level concepts. Details are delayed to Appendix C.
Dataset Splits	No	The paper defines 'Training Setups' and 'Test Setup' where data is generated from specific distributions (PQA, PT, PTQA) with noise, rather than splitting a fixed, external dataset into explicit training, validation, and test portions with specified percentages or counts.
Hardware Specification	No	The paper describes 'Empirical simulations' in Section 5 but does not mention any specific hardware (e.g., GPU/CPU models, memory) used for these simulations.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks) used for the experiments.
Experiment Setup	Yes	For comparison, we use the same parameters for both ICL-trained and QA-trained models and plot the 1-sigma error dynamics, as illustrated in Figures 2 and 3: K = 2, K = 100, d = 3000, n = 200, M = 30, L = L = 30, η = 5, q V = 10 5, σ0 = 10 3, σ1 = 5 10 3, σp = σ p = 10 2. The QA-trained model is trained for T = 2000 epochs, while the ICL-trained model undergoes a longer training process with T = 5000 epochs.