reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search

Authors: Boyan Li, Jiayi Zhang, Ju Fan, Yanwei Xu, Chong Chen, Nan Tang, Yuyu Luo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Alpha-SQL achieves 69.7% execution accuracy on the BIRD development set, using a 32B open-source LLM without fine-tuning.
Researcher Affiliation	Collaboration	1The Hong Kong University of Science and Technology (Guangzhou) 2Renmin University of China 3Huawei Technologies Ltd.
Pseudocode	Yes	The Alpha-SQL algorithm, as outlined in Algorithm 1, operates in multiple phases: Selection, Expansion, Simulation, and Backpropagation. Given a user query q and the corresponding database schema D, the algorithm starts by initializing an empty search tree Ψ = (V, E) with a root node v0 representing the initial state (lines 3-4).
Open Source Code	Yes	The code is available at https://github.com/HKUSTDial/Alpha-SQL.
Open Datasets	Yes	We utilize the Spider (Yu et al., 2018) and BIRD (Li et al., 2023c) development sets for evaluation.
Dataset Splits	Yes	To facilitate more comparison experiments while reducing computational costs (Sections 5.3 to 5.5), we follow CHESSSQL (Talaei et al., 2024) and utilize the same Subsampled Development Set (SDS), which comprises 10% of each database from the BIRD development set. The SDS contains 147 samples, consisting of 81 simple, 54 moderate, and 12 challenging questions.
Hardware Specification	Yes	All experiments are run on an Ubuntu 22.04.3 LTS server with 512GB of RAM and dual 40-core Intel(R) Xeon(R) Platinum 8383C CPUs (@ 2.70GHz). Open-source LLMs are deployed locally using 8 GPUs, each with 80GB of memory and 312 TFLOPS with BF16 precision.
Software Dependencies	No	No specific software library or solver names with version numbers are provided in the paper. While the operating system 'Ubuntu 22.04.3 LTS' is mentioned, it does not fit the criteria of ancillary library or solver versions as per the examples.
Experiment Setup	Yes	The related hyper-parameters were set as follows: For offline database value retrieval, we set the editing similarity ϵedit as 0.3 and semantic similarity ϵsemantic as 0.6. For the MCTS rollout process, we set the number of rollouts to Nrollout = 24. During node expansion, each action was sampled Nexpansion = 3 times with a sampling temperature of Texpansion = 0.8. In the computation of self-supervised rewards, we set the SQL sampling parameters with Nreward = 5 repetitions and a temperature of Treward = 1.0. For the SQL Revision action (A6), we set a maximum iteration limit of Nrevision = 10 for the multi-round correction process.