ReCogLab: a framework testing relational reasoning & cognitive hypotheses on LLMs
Authors: Andrew Liu, Henry Prior, Gargi Balasubramaniam, Rivka Moroshko, Amir Zait, Ilia Labzovsky, Danny Karmon, Ishita Dasgupta, Kimberly Stachenfeld, Kenneth Marino
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate several such experiments and present our findings on a wide variety of open and closed-source language models. We perform a number of experiments and probes to demonstrate how our framework can be used to scale to model performance, probe and analyze model performance, and recreate Cognitive Science experiments on humans and language models using our framework like ordering and congruency. |
| Researcher Affiliation | Industry | 1Google Deepmind, EMAIL |
| Pseudocode | No | The paper describes the generative framework and its components in detail, including how it constructs training examples and the logic for generating syllogisms and graphs. However, it does not present any of this information in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | We release all data and code at https://github.com/google-deepmind/recoglab. We provide technical details in Appendix B and open-source code2 for generating examples from Re Cog Lab. 2https://github.com/google-deepmind/recoglab |
| Open Datasets | Yes | We release the Re Cog Lab framework and datasets used in this manuscript as a useful tool for evaluating examining LLMs for relational reasoning. For names, we use a dataset of 258k popular baby-names3. We use this for both Social Network and Comparison-Age. 3Source: https://www.cs.princeton.edu/introcs/data/names.csv |
| Dataset Splits | Yes | We generate 50 validation examples and report on 250-1000 test examples. We provide technical details in Appendix B and open-source code2 for generating examples from Re Cog Lab. We prepare train-val-test splits on entity names. |
| Hardware Specification | No | The paper lists the language models evaluated (e.g., Google's Gemma and Gemini models, Mixtral, OpenAI's GPT-4o) but does not specify the hardware or computational resources used by the authors to run these evaluations or conduct their experiments. |
| Software Dependencies | Yes | We use Network X Python library to construct randomly generated graphs. We used Jax v0.4.33 randomization implementation since the PRNG key behavior is consistent within a version. |
| Experiment Setup | Yes | Each specific probe involves validating on 50 examples before selecting the best prompt and parser for test-time evaluation on. Please see Appendix A for the library of prompts and parsing strategies. To mitigate this issue, we treat prompts template and answer parsing as hyper-parameters to fit on a validation set first. |