reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: Principles of Animal Cognition to Improve LLM Evaluations

Authors: Sunayana Rane, Cyrus F. Kirkman, Graham Todd, Amanda Royka, Ryan M.C. Law, Erica Cartmill, Jacob Gates Foster

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We ground these principles in an empirical case study, and show how they can already provide a richer picture of one particular reasoning capability: transitive inference. ... 6.1. Experiment 1: Transitive Operator & Element Manipulation ... 6.2. Experiment 2: Trial Structure (n-term task) ... 7. Empirical Results
Researcher Affiliation	Academia	1Department of Computer Science, Princeton University 2Department of Psychology, University of California Los Angeles 3Department of Computer Science and Engineering, New York University Tandon 4Department of Psychology, Yale University 5MRC Cognition and Brain Sciences Unit, University of Cambridge 6Department of Anthropology, Cognitive Science Program, and Program in Animal Behavior, Indiana University Bloomington 7Department of Informatics and Cognitive Science Program, Indiana University Bloomington 8Santa Fe Institute.
Pseudocode	No	The paper describes experimental procedures and results, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or links to repositories.
Open Datasets	No	The paper uses custom-designed stimulus sets for its experiments (e.g., 'ranked words (transitively-linked animal names)', 'random strings', 'seven word stimuli'). It does not refer to or provide access information for a pre-existing publicly available dataset.
Dataset Splits	Yes	After the language model guessed one of two options, it was differentially reinforced with a response of in/correct. Sequential trials were presented in a quasirandom order, in which there could be no more than three consecutive repeats of one trial type. Correct word order within trials was alternated randomly. Piloting showed that the model was able to learn these pairwise discriminations within 3-5 trials, so we presented 10 of each pair for a total of 60 training trials per iteration. After training was complete, we tested for TI by presenting novel non-adjacent pairs.
Hardware Specification	No	The paper mentions evaluating GPT-4o but does not specify the hardware used by the authors to conduct their experiments or interact with the model.
Software Dependencies	No	The paper mentions using 'GPT-4o' but does not specify any other software libraries or their version numbers that were used for the experimental setup or analysis.
Experiment Setup	Yes	Varying the > and bigger than operators serves as a simple adversarial control (P1, P2); if general transitive inference were being used, performance should be insensitive to this variation. We then analyze the specific pattern of failures (P3) as a function of variation in stimulus (P2). Three stimuli sets were used: ranked words (transitively-linked animal names ranked from biggest to smallest size), reverse rank (incorrectly ranked animal names in reverse order of biggest to smallest), and random strings (no transitive link between words). ... We turn to a robust trial-structured task frequently used in animal cognition studies of TI called the n-term task. This trial-based structure is inherently less linguistic as it is operator-agnostic. Our n-term task is designed to note systemic limitations (P5) that may arise from abstracting the task away from the linguistic domain. That being said, we began by giving the LLM some useful linguistic information (more than an animal might receive) by first prompting the model with information regarding ideal performance parameters (see Figure 1 for full prompt). The language model was then presented with a series of consecutive choice trials, each consisting of two words systematically chosen for transitive neutrality. Seven word stimuli were chosen, and were randomly paired across 10 iterations of this task. Within one iteration, pairs remained consistent and were bound in an ascending order (AB, BC...FG, such that A was always correct and B was always incorrect). After the language model guessed one of two options, it was differentially reinforced with a response of in/correct. Sequential trials were presented in a quasirandom order, in which there could be no more than three consecutive repeats of one trial type. Correct word order within trials was alternated randomly. Piloting showed that the model was able to learn these pairwise discriminations within 3-5 trials, so we presented 10 of each pair for a total of 60 training trials per iteration.