reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ReCogLab: a framework testing relational reasoning & cognitive hypotheses on LLMs

Authors: Andrew Liu, Henry Prior, Gargi Balasubramaniam, Rivka Moroshko, Amir Zait, Ilia Labzovsky, Danny Karmon, Ishita Dasgupta, Kimberly Stachenfeld, Kenneth Marino

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate several such experiments and present our ﬁndings on a wide variety of open and closed-source language models. We perform a number of experiments and probes to demonstrate how our framework can be used to scale to model performance, probe and analyze model performance, and recreate Cognitive Science experiments on humans and language models using our framework like ordering and congruency.
Researcher Affiliation	Industry	1Google Deepmind, EMAIL
Pseudocode	No	The paper describes the generative framework and its components in detail, including how it constructs training examples and the logic for generating syllogisms and graphs. However, it does not present any of this information in a structured pseudocode or algorithm block.
Open Source Code	Yes	We release all data and code at https://github.com/google-deepmind/recoglab. We provide technical details in Appendix B and open-source code2 for generating examples from Re Cog Lab. 2https://github.com/google-deepmind/recoglab
Open Datasets	Yes	We release the Re Cog Lab framework and datasets used in this manuscript as a useful tool for evaluating examining LLMs for relational reasoning. For names, we use a dataset of 258k popular baby-names3. We use this for both Social Network and Comparison-Age. 3Source: https://www.cs.princeton.edu/introcs/data/names.csv
Dataset Splits	Yes	We generate 50 validation examples and report on 250-1000 test examples. We provide technical details in Appendix B and open-source code2 for generating examples from Re Cog Lab. We prepare train-val-test splits on entity names.
Hardware Specification	No	The paper lists the language models evaluated (e.g., Google's Gemma and Gemini models, Mixtral, OpenAI's GPT-4o) but does not specify the hardware or computational resources used by the authors to run these evaluations or conduct their experiments.
Software Dependencies	Yes	We use Network X Python library to construct randomly generated graphs. We used Jax v0.4.33 randomization implementation since the PRNG key behavior is consistent within a version.
Experiment Setup	Yes	Each speciﬁc probe involves validating on 50 examples before selecting the best prompt and parser for test-time evaluation on. Please see Appendix A for the library of prompts and parsing strategies. To mitigate this issue, we treat prompts template and answer parsing as hyper-parameters to ﬁt on a validation set ﬁrst.