reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

Authors: Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results reveal a significant decline in accuracy as problem complexity grows a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Through extensive evaluation of various LLMs across diverse architectures and sizes, we observe a dramatic decline in performance as puzzle complexity increases a phenomenon we term the curse of complexity for reasoning. Most models struggle once the puzzle s search space exceeds 107 possibilities (e.g., for puzzles with 4x5 grid size) or when the number of logical conflicts in a widely used SMT solver named Z3 (de Moura & Bjørner, 2008) surpasses 20.
Researcher Affiliation	Collaboration	1University of Washington 2 Allen Institute for AI 3 Stanford University
Pseudocode	Yes	Algorithm 1 Zebra Logic Puzzle Generation. Require: A set of possible attributes Aall and their value sets Va for each a Aall Require: Clue types C = {c1, . . . , c L} with templates T(c) for each c C Require: Number of houses N, number of attributes M 1: Sample M attributes from Aall to form A = {a1, . . . , a M} 2: Initialize solution S : H A S a A Va randomly 3: C Clue Generation(S) // Initialize clue set 4: while C = do 5: p Sample Clue(C) // Sample a clue to remove 6: C C \ {p} 7: if \|Solutions(C )\| = 1 then 8: C C // Remove until S is the unique solution 9: break 10: end if 11: end while 12: return (S, C) // Return solution and minimal clue set
Open Source Code	No	The text does not contain an unambiguous statement of code release or a direct link to a source-code repository for the methodology described in the paper. The URL https://hf.co/spaces/Wild Eval/Zebra Logic is for a Hugging Face Space, which is considered a project demonstration page and not a specific code repository.
Open Datasets	Yes	We create the Zebra Logic dataset, a benchmark of 1,000 logic grid puzzles spanning multiple complexity levels, designed to evaluate LLMs logical reasoning capabilities systematically with two complexity metrics: search space size and Z3 conflict count ( 2). https://hf.co/spaces/Wild Eval/Zebra Logic
Dataset Splits	No	The paper describes how the dataset is categorized into complexity groups (Small, Medium, Large, X-Large) based on search space size, but it does not specify explicit training, validation, or test dataset splits for model evaluation or reproduction of experiments. The evaluation is stated as a 'one-shot in-context learning setting' over 1,000 puzzles without further partitioning details.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper mentions the use of 'Z3 (de Moura & Bjørner, 2008)' and a 'SAT solver' but does not provide specific version numbers for these or any other software components used in the experimental setup.
Experiment Setup	Yes	Our evaluation is done in a one-shot in-context learning setting, where we provide the models with a single example of how to solve a Zebra Logic puzzle and present the solution in JSON format, and we instruct the LLMs to output their reasoning and solution in the same format, thus making it easier to parse and evaluate their answers. All evaluated models are prompted in the same way (see Appendix D.1), and we use the same greedy decoding and prompts and parsing script across all models to ensure a fair comparison, except for O1, which does not only greedy decoding so we run it three times and take the best result.