reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ICLR: In-Context Learning of Representations

Authors: Core Francisco Park, Andrew Lee, Ekdeep Singh Lubana, Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, Hidenori Tanaka

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To answer this question, we take inspiration from the theory of conceptual role semantics and define a toy graph tracing task wherein the nodes of the graph are referenced via concepts seen during training (e.g., apple, bird, etc.) and the connectivity of the graph is defined via some predefined structure (e.g., a square grid). Given exemplars that indicate traces of random walks on the graph, we analyze intermediate representations of the model and find that as the amount of context is scaled, there is a sudden re-organization from pretrained semantic representations to in-context representations aligned with the graph structure. Further, we find that when reference concepts have correlations in their semantics (e.g., Monday, Tuesday, etc.), the context-specified graph structure is still present in the representations, but is unable to dominate the pretrained structure. [...] Overall, our findings indicate scaling context-size can flexibly re-organize model representations, possibly unlocking novel capabilities.
Researcher Affiliation	Collaboration	1CBS-NTT Program in Physics of Intelligence, Harvard University 2Department of Physics, Harvard University 3Physics & Informatics Lab, NTT Research Inc. 4SEAS, Harvard University 5CSE, University of Michigan, Ann Arbor
Pseudocode	No	The paper describes methods and processes in prose and mathematical formulations, but it does not contain any clearly labeled pseudocode or algorithm blocks. For example, Section 4 details the Dirichlet Energy calculation and its use, but without pseudocode formatting.
Open Source Code	No	The paper does not provide an explicit statement about releasing code, nor does it include a link to a code repository in the main text or appendices.
Open Datasets	No	We introduce a toy graph navigation task that requires a model to interpret semantically meaningful concepts as referents for nodes in a structurally constrained graph. Inputting traces of random walks on this graph into an LLM, we analyze whether the model alters its intermediate representations for referent concepts to predict valid next nodes as defined by the underlying graph connectivity... To construct the square grid, we randomly arrange the set of tokens in a grid and add edges between horizontal and vertical neighbors. We then perform a random walk on the graph, emitting the visited tokens as a sequence (Fig. 1 (b)).
Dataset Splits	No	The paper describes generating synthetic data through random walks on predefined graphs for in-context learning. It discusses 'context length' as a variable, and how data is generated for the LLM input, but it does not specify explicit training, validation, or test splits for a dataset in the traditional sense, as it is evaluating pre-trained LLMs rather than training new models.
Hardware Specification	Yes	We run our experiments on either A100 nodes, or by using the APIs provided by NDIF (Fiotto-Kaufman et al., 2024). [...] Part of the computations in this paper were run on the FASRC cluster supported by the FAS Division of Science Research Computing Group at Harvard University.
Software Dependencies	Yes	In the main paper, we primarily focus on Llama3.1-8B (henceforth Llama3) (Dubey et al., 2024), accessed via NDIF/NNsight (Fiotto-Kaufman et al., 2024). We present results on other models Llama3.2-1B / Llama3.1-8B-Instruct (Dubey et al., 2024) and Gemma-2-2B / Gemma-2-9B (Gemma Team, 2024) in App. C.2.
Experiment Setup	Yes	At each timestep, we look at a window of Nw (=50) preceding tokens (or all tokens if the context length is smaller than Nw), and collect all activations corresponding to each token τ ∈ T at a given layer ℓ. [...] In the case when our context length (Nc) is longer than the window, we simply use every token (Nw = Nc).