Earley-Driven Dynamic Pruning for Efficient Structured Decoding

Authors: Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only consistently maintains high-precision compliant outputs but also achieves significant improvements in inference speed up to 2x compared to state-of-the-art implementations.
Researcher Affiliation Academia 1Department of Computer Science, Rice University, Texas, the United States 2Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China. Correspondence to: Shiwen Ni <EMAIL>.
Pseudocode No The paper describes algorithms conceptually, such as the Earley algorithm and its operations (Prediction, Scanning, Completion), but does not provide a clearly labeled pseudocode block or algorithm figure with structured steps.
Open Source Code Yes We release Formatron as open source at https://github.com/Dan-wanna M/formatron.
Open Datasets Yes Test Task. Geoquery (Davis & Meltzer, 2007) transformation converts natural language queries into Fun QL, adhering to fixed predicates and finite entity constraints. JSON Schema (Pezoa et al., 2016) generation produces JSON instances compliant with type, enumeration, and regular expression constraints.
Dataset Splits No The paper does not provide explicit training/test/validation dataset splits with percentages, sample counts, or specific file references for the Geoquery, JSON Schema, or JSON Grammar tasks. While it describes a data augmentation process for multiple runs in Appendix E, this is for generating test data variations, not defining standard model evaluation splits.
Hardware Specification Yes All experiments were conducted on a system equipped with an NVIDIA Ge Force RTX 3090 (24GB VRAM) and an AMD EPYC 7452 32-core processor.
Software Dependencies Yes The software environment consisted of Py Torch 2.4.0 and CUDA 12.4, with model inference performed using Transformers v4.48.0. Four pre-trained large language models were employed in this study: google/gemma-2-9b-it (Gemma Team & Shreya Pathak, 2024), meta-llama/Llama-3-8B-Instruct (Dubey et al., 2024), mistralai/Mistral-7B-Instruct-v0.3 (Jiang et al., 2023), and qwen/Qwen2.5-7B-Instruct (Yang et al., 2024), all utilizing half-precision (FP16) inference. For more details of Python libraries, see the appendix A.
Experiment Setup Yes Four pre-trained large language models were employed in this study: google/gemma-2-9b-it (Gemma Team & Shreya Pathak, 2024), meta-llama/Llama-3-8B-Instruct (Dubey et al., 2024), mistralai/Mistral-7B-Instruct-v0.3 (Jiang et al., 2023), and qwen/Qwen2.5-7B-Instruct (Yang et al., 2024), all utilizing half-precision (FP16) inference.