reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can Transformers Reason Logically? A Study in SAT Solving

Authors: Leyan Pan, Vijay Ganesh, Jacob Abernethy, Chris Esposo, Wenke Lee

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, we prove by construction that decoder-only Transformers can decide 3-SAT, in a non-uniform model of computation, using backtracking and deduction via Chain-of-Thought (Co T). Second, we implement our construction as a Py Torch model with a tool (PARAT) that we designed to empirically demonstrate its correctness and investigate its properties. Third, rather than programming a transformer to reason, we evaluate empirically whether it can be trained to do so by learning directly from algorithmic traces ( reasoning paths ) from our theoretical construction. The trained models demonstrate strong out-of-distribution generalization on problem sizes seen during training but have limited length generalization, which is consistent with the implications of our theoretical result.
Researcher Affiliation	Collaboration	1College of Computing, Georgia Institute of Technology, Atlanta, GA, USA 2Google Research, Atlanta, USA.
Pseudocode	Yes	Algorithm 1 Greedy Decoding
Open Source Code	Yes	With PARAT, we implemented the construction as a Py Torch Transformer model and empirically validated its correctness on random 3-SAT instances. ... We present our full code implementing our construction for Theorem 4.5 using PARAT in Appendix D.4.
Open Datasets	No	We evaluate our models on randomly sampled DIMACS encoding of 3-SAT formulas. ... We generate the above 3 datasets for each variable number 4 p 20, resulting in 51 total datasets of 2000 samples each.
Dataset Splits	Yes	For training data, we use the same sampling methods, but instead of having a separate dataset for each variable number p, we pick 2 ranges p [6, 10] and p [11, 15], where for each sample a random p value is picked uniformly random from the range. Each formula with p variables contains 16.4p to 17.6p tokens. This results in 2 3 training datasets, each containing 5 105 training samples1, with balanced SAT vs UNSAT samples. For each formula, we generate the corresponding Co T in the same format as Figure 1 using a custom SAT Solver. The evaluation data is exactly the same as Section 5.1.
Hardware Specification	No	No specific hardware details (like GPU or CPU models, or cloud instance types) are provided in the paper. The text only describes model architecture and training parameters.
Software Dependencies	No	The paper mentions 'PyTorch' and 'Hugging Face library' and 'Num Py' in the context of the PARAT tool and model implementation, but specific version numbers for these software components are not provided.
Experiment Setup	Yes	We use Llama (Touvron et al., 2023) models in the Hugging Face library. For the 70M model, we use models with 6 layers, 512 embedding dimensions, 8 heads, 512 attention hidden dimensions, and 2048 MLP hidden dimensions. For the 140M model, we use 12 layers, 768 embedding dimensions, 12 heads, 768 attention hidden dimensions, and 3072 MLP hidden dimensions. Both models have 850 context size. We trained for 5 epochs on both datasets using the Adam optimizer with a scheduled cosine learning rate decaying from 6 10 4 to 6 10 5 with β1 = 0.9 and β2 = 0.95.