reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Discovering Symbolic Cognitive Models from Human and Animal Behavior

Authors: Pablo Samuel Castro, Nenad Tomasev, Ankit Anand, Navodita Sharma, Rishika Mohanta, Aparna Dev, Kuba Perlin, Siddhant Jain, Kyle Levin, Noemi Elteto, Will Dabney, Alexander Novikov, Glenn C Turner, Maria K Eckstein, Nathaniel D. Daw, Kevin J Miller, Kim Stachenfeld

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We consider datasets from three species performing a classic reward-learning task that has been the focus of substantial modeling effort, and find that the discovered programs outperform state-of-the-art cognitive models for each. The discovered programs can readily be interpreted as hypotheses about human and animal cognition, instantiating interpretable symbolic learning and decision-making algorithms. Figure 1. Discovered models outperform human-designed models. We evaluate the best program discovered by Cog Fun Search for each dataset, using average normalized likelihood of the choices made by held-out test subjects, and it to the best existing model from the neuroscience and psychology literature (all p < 0.002, signed-rank test)
Researcher Affiliation	Collaboration	1Google Deep Mind 2Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, VA, USA
Pseudocode	Yes	The paper includes structured code blocks in Appendix E ('Baseline Programs'), Appendix G ('Best Discovered Programs'), and Appendix H ('Seed program comparison'), such as: 'def agent( params: chex.Array, choice: int, reward: int, agent_state: Optional[chex.Array], ) -> Tuple[chex.Array, chex.Array]:' in section E.1.
Open Source Code	No	The paper does not provide an explicit statement about the release of their source code or a link to a repository for the methodology described in this paper. It mentions using Fun Search and Python community tools, but not the specific code implemented for this work.
Open Datasets	Yes	Human Dataset (Fig. 3A; Eckstein et al. 2024) considers human participants performing a four-alternative task with graded rewards. Rat Dataset (Fig. 3B; Miller et al. 2021) considers rats performing a two-armed bandit task with binary rewards. Fruit Fly Dataset (Fig. 3C; Mohanta 2022; Rajagopalan et al. 2023) considers fruit flies performing a two-armed bandit task with binary rewards.
Dataset Splits	Yes	In particular, for each subject i, we split its sessions into even and odd sets deven i := {si,0, si,2, . . . , si,M 1} and di,odd := {si,1, si,3, . . . , si,M}, respectively. For the fruit fly dataset, since we have only one session per subject, we forego this additional level of variation and treat the dataset as though it were multiple sessions from a single subject with a single θ. We maintain a group of held-out subjects, Dtest, in order to validate our discovered programs. Figure 9. Organizing data for train and test. In Human and Rat datasets we use half of the subjects for training, and half for testing; for each train subject, we use half of its sessions for parameter fitting and half for evaluation. For the Fly dataset we use half the subjects for training and half for testing, and proceed similarly as for the other datasets.
Hardware Specification	No	The paper does not explicitly state the specific hardware used for running its experiments. It mentions using LLMs like Gemini 1.5 Flash, which implies high-performance computing, but no details on CPU, GPU models, or other hardware specifications are provided.
Software Dependencies	Yes	The authors would also like to thank the Python community (Van Rossum & Drake Jr, 1995; Oliphant, 2007) for developing tools that enabled this work, including Num Py (Harris et al., 2020), Matplotlib (Hunter, 2007), Jupyter (Kluyver et al., 2016), Pandas (Mc Kinney, 2013) and JAX (Bradbury et al., 2018b). Cog Fun Search s programs must be implemented in Jax (Bradbury et al., 2018a) so that they are differentiable.
Experiment Setup	Yes	We use the Ada Belief optimizer with learning rate 5 10 2 which is run until convergence or until 10,000 steps of gradient descent are reached. In order to test convergence, we compare the current score at iteration k, Ωk to the previously recorded score Ωk 100 every 100 steps. If the relative change in score \|(Ωk Ωk 100)/Ωk 100\| is less than a convergence threshold 10 2, we conclude that parameter fitting has converged. Specifically, we train a GRU model (Cho et al., 2014) over di,even, run a sweep on the number of hidden units (over {1, 2, 4, 8, 16, 32, 64, 128}), and use early-stopping to select the best parameters. All the variants were trained with the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 1e 4.