Position: Principles of Animal Cognition to Improve LLM Evaluations
Authors: Sunayana Rane, Cyrus F. Kirkman, Graham Todd, Amanda Royka, Ryan M.C. Law, Erica Cartmill, Jacob Gates Foster
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We ground these principles in an empirical case study, and show how they can already provide a richer picture of one particular reasoning capability: transitive inference. ... 6.1. Experiment 1: Transitive Operator & Element Manipulation ... 6.2. Experiment 2: Trial Structure (n-term task) ... 7. Empirical Results |
| Researcher Affiliation | Academia | 1Department of Computer Science, Princeton University 2Department of Psychology, University of California Los Angeles 3Department of Computer Science and Engineering, New York University Tandon 4Department of Psychology, Yale University 5MRC Cognition and Brain Sciences Unit, University of Cambridge 6Department of Anthropology, Cognitive Science Program, and Program in Animal Behavior, Indiana University Bloomington 7Department of Informatics and Cognitive Science Program, Indiana University Bloomington 8Santa Fe Institute. |
| Pseudocode | No | The paper describes experimental procedures and results, but it does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or links to repositories. |
| Open Datasets | No | The paper uses custom-designed stimulus sets for its experiments (e.g., 'ranked words (transitively-linked animal names)', 'random strings', 'seven word stimuli'). It does not refer to or provide access information for a pre-existing publicly available dataset. |
| Dataset Splits | Yes | After the language model guessed one of two options, it was differentially reinforced with a response of in/correct. Sequential trials were presented in a quasirandom order, in which there could be no more than three consecutive repeats of one trial type. Correct word order within trials was alternated randomly. Piloting showed that the model was able to learn these pairwise discriminations within 3-5 trials, so we presented 10 of each pair for a total of 60 training trials per iteration. After training was complete, we tested for TI by presenting novel non-adjacent pairs. |
| Hardware Specification | No | The paper mentions evaluating GPT-4o but does not specify the hardware used by the authors to conduct their experiments or interact with the model. |
| Software Dependencies | No | The paper mentions using 'GPT-4o' but does not specify any other software libraries or their version numbers that were used for the experimental setup or analysis. |
| Experiment Setup | Yes | Varying the > and bigger than operators serves as a simple adversarial control (P1, P2); if general transitive inference were being used, performance should be insensitive to this variation. We then analyze the specific pattern of failures (P3) as a function of variation in stimulus (P2). Three stimuli sets were used: ranked words (transitively-linked animal names ranked from biggest to smallest size), reverse rank (incorrectly ranked animal names in reverse order of biggest to smallest), and random strings (no transitive link between words). ... We turn to a robust trial-structured task frequently used in animal cognition studies of TI called the n-term task. This trial-based structure is inherently less linguistic as it is operator-agnostic. Our n-term task is designed to note systemic limitations (P5) that may arise from abstracting the task away from the linguistic domain. That being said, we began by giving the LLM some useful linguistic information (more than an animal might receive) by first prompting the model with information regarding ideal performance parameters (see Figure 1 for full prompt). The language model was then presented with a series of consecutive choice trials, each consisting of two words systematically chosen for transitive neutrality. Seven word stimuli were chosen, and were randomly paired across 10 iterations of this task. Within one iteration, pairs remained consistent and were bound in an ascending order (AB, BC...FG, such that A was always correct and B was always incorrect). After the language model guessed one of two options, it was differentially reinforced with a response of in/correct. Sequential trials were presented in a quasirandom order, in which there could be no more than three consecutive repeats of one trial type. Correct word order within trials was alternated randomly. Piloting showed that the model was able to learn these pairwise discriminations within 3-5 trials, so we presented 10 of each pair for a total of 60 training trials per iteration. |