reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Structure-Guided Large Language Models for Text-to-SQL Generation

Authors: Qinggang Zhang, Hao Chen, Junnan Dong, Shengyuan Chen, Feiran Huang, Xiao Huang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on two benchmark datasets demonstrate that SGU-SQL consistently outperforms state-of-the-art text-to-SQL models. Experiments on two benchmarks verify that SGU-SQL outperforms state-of-the-art baselines, including 11 finetuning models, 7 structure learning models, and 14 in-context learning models.
Researcher Affiliation	Academia	1Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong SAR, China 2City University of Macau, Macau SAR, China 3College of Information Science and Technology, Jinan University, GZ, China.
Pseudocode	No	The paper describes the methodology using natural language, definitions, and mathematical formulations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions several open-source LLMs and baseline models but does not explicitly state that the code for the proposed SGU-SQL methodology is open-source, nor does it provide any specific repository links.
Open Datasets	Yes	Datasets We assess the performance of text-to-SQL models using two renowned datasets, Spider (Yu et al., 2019) and BIRD (Li et al., 2023c).
Dataset Splits	Yes	Spider, a cross-domain text-to-SQL dataset, comprises 8659 instances in the training split and 1034 instances in the development split, spanning across 200 databases. Each instance comprises a natural language question related to a specific database and its corresponding SQL query. For evaluation purposes, we utilize the Spider-dev development split since the test split has not been released.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU models, or memory specifications.
Software Dependencies	No	The paper mentions using various Large Language Models (LLMs) and other text-to-SQL methods as baselines, but it does not specify any software dependencies with version numbers for its own implementation (e.g., Python, PyTorch, specific libraries and their versions).
Experiment Setup	No	The paper discusses various prompting strategies and compares them, and mentions fine-tuning methods like LoRA and QLoRA for backbone LLMs. However, it does not explicitly provide specific hyperparameters (e.g., learning rate, batch size, number of epochs) or detailed system-level training settings for its experiments in the main text.