reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Earley-Driven Dynamic Pruning for Efficient Structured Decoding

Authors: Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only consistently maintains high-precision compliant outputs but also achieves significant improvements in inference speed up to 2x compared to state-of-the-art implementations.
Researcher Affiliation	Academia	1Department of Computer Science, Rice University, Texas, the United States 2Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China. Correspondence to: Shiwen Ni <EMAIL>.
Pseudocode	No	The paper describes algorithms conceptually, such as the Earley algorithm and its operations (Prediction, Scanning, Completion), but does not provide a clearly labeled pseudocode block or algorithm figure with structured steps.
Open Source Code	Yes	We release Formatron as open source at https://github.com/Dan-wanna M/formatron.
Open Datasets	Yes	Test Task. Geoquery (Davis & Meltzer, 2007) transformation converts natural language queries into Fun QL, adhering to fixed predicates and finite entity constraints. JSON Schema (Pezoa et al., 2016) generation produces JSON instances compliant with type, enumeration, and regular expression constraints.
Dataset Splits	No	The paper does not provide explicit training/test/validation dataset splits with percentages, sample counts, or specific file references for the Geoquery, JSON Schema, or JSON Grammar tasks. While it describes a data augmentation process for multiple runs in Appendix E, this is for generating test data variations, not defining standard model evaluation splits.
Hardware Specification	Yes	All experiments were conducted on a system equipped with an NVIDIA Ge Force RTX 3090 (24GB VRAM) and an AMD EPYC 7452 32-core processor.
Software Dependencies	Yes	The software environment consisted of Py Torch 2.4.0 and CUDA 12.4, with model inference performed using Transformers v4.48.0. Four pre-trained large language models were employed in this study: google/gemma-2-9b-it (Gemma Team & Shreya Pathak, 2024), meta-llama/Llama-3-8B-Instruct (Dubey et al., 2024), mistralai/Mistral-7B-Instruct-v0.3 (Jiang et al., 2023), and qwen/Qwen2.5-7B-Instruct (Yang et al., 2024), all utilizing half-precision (FP16) inference. For more details of Python libraries, see the appendix A.
Experiment Setup	Yes	Four pre-trained large language models were employed in this study: google/gemma-2-9b-it (Gemma Team & Shreya Pathak, 2024), meta-llama/Llama-3-8B-Instruct (Dubey et al., 2024), mistralai/Mistral-7B-Instruct-v0.3 (Jiang et al., 2023), and qwen/Qwen2.5-7B-Instruct (Yang et al., 2024), all utilizing half-precision (FP16) inference.