Flexible and Efficient Grammar-Constrained Decoding

Authors: Kanghee Park, Timothy Zhou, Loris D’Antoni

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experiments We implemented our algorithms for computing token masks in a tool called GREATGRAMMA. A testament to the simplicity of our approach is that GREATGRAMMA is implemented in just 900 lines of Python code built on top of the LALR parser provided by the LARK library (Shinan et al.). Specifically, we used LARK to parse regular expressions and context-free grammars, and used LARK s LALR parser to generate a parse table representing a deterministic PDA for parsing sequences of tokens in a given context-free grammar. In this section, we evaluate GREATGRAMMA by answering the following two questions: RQ1: How does the offline preprocessing overhead of GREATGRAMMA compares to existing GCD approaches? RQ2: How does the online per-token decoding overhead of GREATGRAMMA compares to existing GCD approaches? ... Tab. 1 reports the results.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, UCSD, San Diego, USA. Correspondence to: Kanghee park <EMAIL>. Loris D Antoni holds concurrent appointments as a Professor at the University of California San Diego and as an Amazon Scholar. This paper describes work performed at the University of California San Diego and is not associated with Amazon.
Pseudocode Yes A. Algorithms In this section we formalize several algorithms presented in the paper. Alg. 1 describes the abstract algorithm for general constrained decoding. 2 describes how to build the lexing transducer from a lexing automaton given as a FSA. Alg. 3 describes the construction of the detokenizing transducer, which converts sequences of tokens to sequences of the characters they contain. Alg. 5 describes the parser preprocessing and how the always-accepted token table and context-dependent sequence table are built. (Algorithm 1 CONSTRAINEDDECODING is presented as a block).
Open Source Code No We implemented our algorithms for computing token masks in a tool called GREATGRAMMA. A testament to the simplicity of our approach is that GREATGRAMMA is implemented in just 900 lines of Python code built on top of the LALR parser provided by the LARK library (Shinan et al.). The paper does not provide a link to a repository or explicitly state that the code is publicly available.
Open Datasets Yes Models and Grammars. We conduct our experiments using three different tokenizers: Llama (32K tokens), Llama-3 (128K), and Qwen-2 (151K). We consider simplified Go, Java, and Python grammars from prior work on GCD (Ugare et al., 2024). The evaluation provided on LLGUIDANCE s website reports impressive performance (in the order of micro to milliseconds for both offline pre-processing and online per-token overhead) on benchmarks like JSONSchema Bench (Geng et al., 2025).
Dataset Splits No Measures. For each combination of tokenizer and grammar, we measured the time taken to preprocess the grammar for that tokenizer and the average time taken by each GCD implementation to produce a token at inference time. To fairly evaluate per-token computation time, we wanted to ensure that all three tools followed the same token selection process and recorded the time required to compute the mask at each step i.e., the online overhead. We manually built 5 programs in each grammar and used them to decide what sequence of tokens the decoder was going to produce i.e., for each example program, the decoder produced each token in the program in order and computed the token masks at each step. This describes how data was used for evaluation, but not traditional train/test/validation splits.
Hardware Specification No Models and Grammars. We conduct our experiments using three different tokenizers: Llama (32K tokens), Llama-3 (128K), and Qwen-2 (151K). No specific hardware (GPU, CPU models, etc.) used for running the experiments is mentioned.
Software Dependencies No We implemented our algorithms for computing token masks in a tool called GREATGRAMMA. A testament to the simplicity of our approach is that GREATGRAMMA is implemented in just 900 lines of Python code built on top of the LALR parser provided by the LARK library (Shinan et al.). While Python and LARK are mentioned, no specific version numbers are provided for these software components.
Experiment Setup No Measures. For each combination of tokenizer and grammar, we measured the time taken to preprocess the grammar for that tokenizer and the average time taken by each GCD implementation to produce a token at inference time. To fairly evaluate per-token computation time, we wanted to ensure that all three tools followed the same token selection process and recorded the time required to compute the mask at each step i.e., the online overhead. We manually built 5 programs in each grammar and used them to decide what sequence of tokens the decoder was going to produce i.e., for each example program, the decoder produced each token in the program in order and computed the token masks at each step. This section describes the experimental evaluation process but does not provide specific hyperparameters or model training settings.