CRANE: Reasoning with constrained LLM generation
Authors: Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, Gagandeep Singh
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on multiple open-source LLMs and benchmarks show that CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding, showing up to 10% points accuracy improvement over baselines on challenging symbolic reasoning benchmarks GSM-symbolic and FOLIO. ... In this section, we evaluate CRANE on a math reasoning task (GSM-Symbolic (Mirzadeh et al., 2024)) and a logical reasoning task (FOLIO (Han et al., 2024)) and demonstrate significant improvement over both unconstrained and SOTA constrained generation baselines. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Illinois Urbana-Champaign, USA. Correspondence to: Debangshu Banerjee <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 CRANE Algorithm |
| Open Source Code | No | The paper does not explicitly state that the code for the methodology is open-source, nor does it provide a direct link to a code repository. It mentions CRANE is implemented using Py Torch and Hugging Face transformers, which are third-party libraries, but not the specific code for CRANE. |
| Open Datasets | Yes | We evaluate CRANE on a math reasoning task (GSM-Symbolic (Mirzadeh et al., 2024)) and a logical reasoning task (FOLIO (Han et al., 2024)). |
| Dataset Splits | Yes | We further evaluate CRANE on the validation split of FOLIO dataset... We use 2 few-shot examples in the prompt. |
| Hardware Specification | Yes | Experimental Setup. We run experiments on a 48-core Intel Xeon Silver 4214R CPU with 2 NVidia RTX A5000 GPUs. |
| Software Dependencies | No | The paper mentions using Py Torch (Paszke et al., 2019), Hugging Face transformers library (Wolf et al., 2020), Z3 solver (De Moura & Bjørner, 2008), ITERGEN library (Ugare et al., 2024a), and SYNCODE framework (Han et al., 2024). However, it provides citations to papers describing these tools/libraries rather than specific software version numbers required for reproduction (e.g., PyTorch 1.9, Z3 v4.8.10). |
| Experiment Setup | Yes | We run greedy decoding with a maximum new token limit of 600 and prompt the LLMs with the 8-shot examples from GSM-Symbolic... For ITERGEN and CRANE, we enforce syntactic constraints via the context-free grammar provided in Appendix D.5.1 and apply the semantic constraint... For CRANE, we use << and >> for the delimeters S1 and S2, respectively. ... For all approaches and models, we run greedy decoding with a maximum new tokens limit of 800 and use 2 few-shot examples in the prompt. |