ReCogLab: a framework testing relational reasoning & cognitive hypotheses on LLMs

Authors: Andrew Liu, Henry Prior, Gargi Balasubramaniam, Rivka Moroshko, Amir Zait, Ilia Labzovsky, Danny Karmon, Ishita Dasgupta, Kimberly Stachenfeld, Kenneth Marino

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate several such experiments and present our findings on a wide variety of open and closed-source language models. We perform a number of experiments and probes to demonstrate how our framework can be used to scale to model performance, probe and analyze model performance, and recreate Cognitive Science experiments on humans and language models using our framework like ordering and congruency.
Researcher Affiliation Industry 1Google Deepmind, EMAIL
Pseudocode No The paper describes the generative framework and its components in detail, including how it constructs training examples and the logic for generating syllogisms and graphs. However, it does not present any of this information in a structured pseudocode or algorithm block.
Open Source Code Yes We release all data and code at https://github.com/google-deepmind/recoglab. We provide technical details in Appendix B and open-source code2 for generating examples from Re Cog Lab. 2https://github.com/google-deepmind/recoglab
Open Datasets Yes We release the Re Cog Lab framework and datasets used in this manuscript as a useful tool for evaluating examining LLMs for relational reasoning. For names, we use a dataset of 258k popular baby-names3. We use this for both Social Network and Comparison-Age. 3Source: https://www.cs.princeton.edu/introcs/data/names.csv
Dataset Splits Yes We generate 50 validation examples and report on 250-1000 test examples. We provide technical details in Appendix B and open-source code2 for generating examples from Re Cog Lab. We prepare train-val-test splits on entity names.
Hardware Specification No The paper lists the language models evaluated (e.g., Google's Gemma and Gemini models, Mixtral, OpenAI's GPT-4o) but does not specify the hardware or computational resources used by the authors to run these evaluations or conduct their experiments.
Software Dependencies Yes We use Network X Python library to construct randomly generated graphs. We used Jax v0.4.33 randomization implementation since the PRNG key behavior is consistent within a version.
Experiment Setup Yes Each specific probe involves validating on 50 examples before selecting the best prompt and parser for test-time evaluation on. Please see Appendix A for the library of prompts and parsing strategies. To mitigate this issue, we treat prompts template and answer parsing as hyper-parameters to fit on a validation set first.