Can Transformers Reason Logically? A Study in SAT Solving
Authors: Leyan Pan, Vijay Ganesh, Jacob Abernethy, Chris Esposo, Wenke Lee
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we prove by construction that decoder-only Transformers can decide 3-SAT, in a non-uniform model of computation, using backtracking and deduction via Chain-of-Thought (Co T). Second, we implement our construction as a Py Torch model with a tool (PARAT) that we designed to empirically demonstrate its correctness and investigate its properties. Third, rather than programming a transformer to reason, we evaluate empirically whether it can be trained to do so by learning directly from algorithmic traces ( reasoning paths ) from our theoretical construction. The trained models demonstrate strong out-of-distribution generalization on problem sizes seen during training but have limited length generalization, which is consistent with the implications of our theoretical result. |
| Researcher Affiliation | Collaboration | 1College of Computing, Georgia Institute of Technology, Atlanta, GA, USA 2Google Research, Atlanta, USA. |
| Pseudocode | Yes | Algorithm 1 Greedy Decoding |
| Open Source Code | Yes | With PARAT, we implemented the construction as a Py Torch Transformer model and empirically validated its correctness on random 3-SAT instances. ... We present our full code implementing our construction for Theorem 4.5 using PARAT in Appendix D.4. |
| Open Datasets | No | We evaluate our models on randomly sampled DIMACS encoding of 3-SAT formulas. ... We generate the above 3 datasets for each variable number 4 p 20, resulting in 51 total datasets of 2000 samples each. |
| Dataset Splits | Yes | For training data, we use the same sampling methods, but instead of having a separate dataset for each variable number p, we pick 2 ranges p [6, 10] and p [11, 15], where for each sample a random p value is picked uniformly random from the range. Each formula with p variables contains 16.4p to 17.6p tokens. This results in 2 3 training datasets, each containing 5 105 training samples1, with balanced SAT vs UNSAT samples. For each formula, we generate the corresponding Co T in the same format as Figure 1 using a custom SAT Solver. The evaluation data is exactly the same as Section 5.1. |
| Hardware Specification | No | No specific hardware details (like GPU or CPU models, or cloud instance types) are provided in the paper. The text only describes model architecture and training parameters. |
| Software Dependencies | No | The paper mentions 'PyTorch' and 'Hugging Face library' and 'Num Py' in the context of the PARAT tool and model implementation, but specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | We use Llama (Touvron et al., 2023) models in the Hugging Face library. For the 70M model, we use models with 6 layers, 512 embedding dimensions, 8 heads, 512 attention hidden dimensions, and 2048 MLP hidden dimensions. For the 140M model, we use 12 layers, 768 embedding dimensions, 12 heads, 768 attention hidden dimensions, and 3072 MLP hidden dimensions. Both models have 850 context size. We trained for 5 epochs on both datasets using the Adam optimizer with a scheduled cosine learning rate decaying from 6 10 4 to 6 10 5 with β1 = 0.9 and β2 = 0.95. |