reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

Authors: Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, Sercan Arik

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present comprehensive evaluations on the efﬁcacy of proposed methodologies of CHASE-SQL. Our innovative candidate generation approaches demonstrate superior performance compared to traditional generic Co T prompts, illustrating their capability in guiding LLMs through the decomposition of complex problems into manageable intermediate steps. Furthermore, the proposed selection agent signiﬁcantly outperforms conventional consistency-based methods, contributing to the stateof-the-art results. Speciﬁcally, CHASE-SQL reaches an execution accuracy of 73.01% and 73.0% on the development set and test set of the challenging BIRD Text-to-SQL dataset which outperforms all of the published and undisclosed methods on this benchmark, by a large margin.
Researcher Affiliation	Collaboration	1Google Cloud, Sunnyvale, CA, USA 2Stanford University, Stanford, CA, USA
Pseudocode	Yes	Algorithm 1 Divide and Conquer Chain-of-Thought (Co T) Strategy for Text-to-SQL. Algorithm 2 Online Synthetic example generation strategy for Text-to-SQL. Algorithm 3 Picking the ﬁnal SQL query from a pool of candidates. Algorithm 4 Query ﬁxing method.
Open Source Code	No	The paper does not provide an explicit statement or link to the open-source code for the methodology described in this paper.
Open Datasets	Yes	We evaluate the performance of the proposed CHASE-SQL framework on two widely-recognized cross-domain datasets: BIRD (Li et al., 2024c) and Spider (Yu et al., 2018).
Dataset Splits	Yes	The Spider dataset is divided into non-overlapping training, development, and test sets similar to BIRD.
Hardware Specification	No	The paper mentions using Gemini and Claude models and training a Gemini 1.5 Flash model using Vertex AI tuning API, but does not provide specific hardware details such as GPU/CPU models or memory specifications.
Software Dependencies	Yes	Moreover, by leveraging entirely open-source models Mistral Large Model (AI, 2024) as the candidate generator and a ﬁne-tuned Qwen-2.5-coder model (Team, 2024) as the selector our method achieved a state-of-the-art performance of 70.33 on the BIRD development set with open-source models.
Experiment Setup	Yes	The Gemini 1.5 Flash model is trained for 10 epochs using a Lo RA adapter with a rank of 16 using Vertex AI tuning API.