reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: We Need An Algorithmic Understanding of Generative AI

Authors: Oliver Eberle, Thomas Austin Mcgee, Hamza Giaffar, Taylor Whittington Webb, Ida Momennejad

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To ground our position in an empirical example, we conducted a case study focused on LLMs, which have been shown to perform poorly on graph navigation and multistep planning tasks (Momennejad et al., 2023). In cases where they do succeed, it remains unclear how they solve these problems, e.g., whether they implement classic search algorithms or use other strategies. To address this question, we studied the algorithms used by widely used LLMs, instruction-tuned Llama-3.1 with 8B and 70B parameters, in the context of graph navigation.
Researcher Affiliation	Collaboration	1Technische Universit at Berlin, Berlin, Germany 2BIFOLD Berlin Institute for the Foundations of Learning and Data, Berlin, Germany 3University of California Los Angeles, Los Angeles, USA 4Halıcıo glu Data Science Institute, University of California San Diego, San Diego, USA 5Microsoft Research NYC, New York, USA.
Pseudocode	No	The paper discusses algorithmic concepts and methodologies but does not provide any structured pseudocode or algorithm blocks. Figure 1 shows a conceptual flow diagram, not pseudocode.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	No	The paper describes a case study using 'instruction-tuned Llama-3.1 with 8B and 70B parameters' and a 'simple tree graph structure, presented in a prompt'. While Llama models are generally available, the paper does not provide concrete access information (link, citation for dataset) for any specific dataset used in their experiments, beyond describing the task and the prompt itself.
Dataset Splits	No	The paper describes a case study involving the analysis of pre-trained LLMs on a specific graph navigation task defined by a prompt. It does not involve traditional dataset training, validation, and testing splits, and therefore no such split information is provided.
Hardware Specification	No	The paper mentions using Llama-3.1-8B and 70B models for experiments but does not specify any hardware details (e.g., GPU models, CPU types, memory) used for running these analyses.
Software Dependencies	Yes	Mixed-effects modeling was conducted using the lmerTest package in R.
Experiment Setup	Yes	Prompt. We introduce the model to a two-step tree graph following the prompt from Momennejad et al. (2023), which demonstrated that LLMs struggle with graph navigation and especially tree search. The model is tasked with determining the validity of a given path, producing a single token output: yes or no . The full prompt and task for starting from the lobby and goal location W are shown in Figure 3a. We next present results on Llama-3.1-8B with additional analyses of the 70B model presented in Appendix A.4.