CRANE: Reasoning with constrained LLM generation

Authors: Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, Gagandeep Singh

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on multiple open-source LLMs and benchmarks show that CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding, showing up to 10% points accuracy improvement over baselines on challenging symbolic reasoning benchmarks GSM-symbolic and FOLIO. ... In this section, we evaluate CRANE on a math reasoning task (GSM-Symbolic (Mirzadeh et al., 2024)) and a logical reasoning task (FOLIO (Han et al., 2024)) and demonstrate significant improvement over both unconstrained and SOTA constrained generation baselines.
Researcher Affiliation Academia 1Department of Computer Science, University of Illinois Urbana-Champaign, USA. Correspondence to: Debangshu Banerjee <EMAIL>.
Pseudocode Yes Algorithm 1 CRANE Algorithm
Open Source Code No The paper does not explicitly state that the code for the methodology is open-source, nor does it provide a direct link to a code repository. It mentions CRANE is implemented using Py Torch and Hugging Face transformers, which are third-party libraries, but not the specific code for CRANE.
Open Datasets Yes We evaluate CRANE on a math reasoning task (GSM-Symbolic (Mirzadeh et al., 2024)) and a logical reasoning task (FOLIO (Han et al., 2024)).
Dataset Splits Yes We further evaluate CRANE on the validation split of FOLIO dataset... We use 2 few-shot examples in the prompt.
Hardware Specification Yes Experimental Setup. We run experiments on a 48-core Intel Xeon Silver 4214R CPU with 2 NVidia RTX A5000 GPUs.
Software Dependencies No The paper mentions using Py Torch (Paszke et al., 2019), Hugging Face transformers library (Wolf et al., 2020), Z3 solver (De Moura & Bjørner, 2008), ITERGEN library (Ugare et al., 2024a), and SYNCODE framework (Han et al., 2024). However, it provides citations to papers describing these tools/libraries rather than specific software version numbers required for reproduction (e.g., PyTorch 1.9, Z3 v4.8.10).
Experiment Setup Yes We run greedy decoding with a maximum new token limit of 600 and prompt the LLMs with the 8-shot examples from GSM-Symbolic... For ITERGEN and CRANE, we enforce syntactic constraints via the context-free grammar provided in Appendix D.5.1 and apply the semantic constraint... For CRANE, we use << and >> for the delimeters S1 and S2, respectively. ... For all approaches and models, we run greedy decoding with a maximum new tokens limit of 800 and use 2 few-shot examples in the prompt.