reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

IterGen: Iterative Semantic-aware Structured LLM Generation with Backtracking

Authors: Shubham Dipak Ugare, Rohan Gumaste, Tarun Suresh, Gagandeep Singh, Sasa Misailovic

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation presents three distinct scenarios, which demonstrate the effectiveness of ITERGEN. First, we illustrate how it can be used to improve the accuracy of LLM-generated SQL queries by enforcing additional semantic constraints. ITERGEN achieves 18.5% mean improvement over the state-of-the-art grammar-guided generation technique (Ugare et al., 2024). Second, we show how ITERGEN effectively reduces privacy leaks in LLM-generated text from 51.4% to 0%, thus successfully safeguarding sensitive information while maintaining the quality of response. Third, we show that ITERGEN improves the accuracy of LLM-generated Vega-lite specification (a subset of JSON for data visualization) by 17.8% by enforcing semantic constraints.
Researcher Affiliation	Academia	Shubham Ugare, Rohan Gumaste, Tarun Suresh, Gagandeep Singh, Sasa Misailovic University of Illinois Urbana-Champaign EMAIL
Pseudocode	Yes	The detailed pseudocode for the forward and backward algorithm are presented in Appendix A.1. A.1.1 ALGORITHM 1: START FUNCTION A.1.2 ALGORITHM 2: FORWARD FUNCTION A.1.3 ALGORITHM 3: BACKWARD FUNCTION
Open Source Code	Yes	Our code and additional resources are available at http://structuredllm.com. ITERGEN code is available at https://github.com/uiuc-arc/itergen. We provide the source code of ITERGEN as part of the supplementary material that can be used to reproduce our results.
Open Datasets	Yes	We use the standard Spider (Yu et al., 2018) text-2-SQL dataset for the evaluation. This dataset has 1034 problems, that are categorized into different difficulty levels easy (250), medium (440), hard (174), and extra hard (170). ... We use the Decoding Trust (Wang et al., 2024) privacy dataset... For the evaluation, we use the NLV Corpus (Srinivasan et al., 2021), a dataset comprising 814 examples of text utterances paired with corresponding Vega-Lite visualization specifications.
Dataset Splits	No	We use the standard Spider (Yu et al., 2018) text-2-SQL dataset for the evaluation. This dataset has 1034 problems, that are categorized into different difficulty levels easy (250), medium (440), hard (174), and extra hard (170). ... For the evaluation, we use the NLV Corpus (Srinivasan et al., 2021), a dataset comprising 814 examples of text utterances paired with corresponding Vega-Lite visualization specifications. The paper mentions the datasets and their categorization or total problem counts but does not explicitly describe how the data was split into training, validation, or test sets for their experiments.
Hardware Specification	Yes	Experimental Setup. We run experiments on a 48-core Intel Xeon Silver 4214R CPU with 2 NVidia RTX A5000 GPUs. ITERGEN is implemented using Py Torch (Paszke et al., 2019), Hugging Face transformers library (Wolf et al., 2020) and SYNCODE library (Ugare et al., 2024) for the parser-guided LLM generation infrastructure.
Software Dependencies	No	ITERGEN is implemented using Py Torch (Paszke et al., 2019), Hugging Face transformers library (Wolf et al., 2020) and SYNCODE library (Ugare et al., 2024) for the parser-guided LLM generation infrastructure. The paper lists software libraries used (Py Torch, Hugging Face transformers, SYNCODE) but does not provide specific version numbers for these dependencies, which are crucial for reproducibility.
Experiment Setup	Yes	We use greedy decoding for the experiment and set ITERGEN s maximum limit for moving backward as max_iter=20 and set the ITERGEN recurrence penalty to 0.7, as it worked well on a small subset of the training dataset. We use \n\n as an additional stop word to the EOS token for all experiments and use max new token limit as 100 for all three methods. ... For ITERGEN we set a recurrence penalty \u03b3 to 0.7, and limit the number of per-email backtracking attempts to 10. ... For ITERGEN we set a recurrence penalty \u03b3 to 0.1, and set max_iter to 50.