reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

Authors: Yang Qin, Chao Chen, Zhihang Fu, Ze Chen, Dezhong Peng, Peng Hu, Jieping Ye

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments and in-depth analyses have been performed on eight open-source LLMs and five widely-used benchmarks. The results demonstrate that our proposal outperforms the latest Text2SQL methods and yields promising performance.
Researcher Affiliation	Academia	1Sichuan University, 2Independent Researcher, 3Tianfu Jincheng Laboratory
Pseudocode	Yes	A.9 ALGORITHM OF MCP In this appendix, to make our MCP clearer, we describe the pipeline in detail in Algorithm 1. Algorithm 1 The algorithm of MCP
Open Source Code	No	The code and data are available at here. Finally, we have released the code and synthetic data at here for reproducibility, thus further advancing the Text2SQL community.
Open Datasets	Yes	To evaluate our method, we conduct extensive experiments on five benchmarks to verify the effectiveness of our method. These benchmark includes two widely-used cross-domain benchmarks, i.e., SPIDER (Yu et al., 2018) and BIRD (Li et al., 2024c), and three robust benchmarks derived from SPIDER, i.e., SPIDER-SYN (Gan et al., 2021a), SPIDER-DK (Gan et al., 2021b), and SPIDER-Realistic (Deng et al., 2020).
Dataset Splits	Yes	SPIDER consists of 7,000 Text-SQL pairs in the training set, 1,034 pairs in the development set, and 2,147 pairs in the test set, which covers nearly 200 databases and 138 domains. BIRD is a recently proposed benchmark including 9,428, 1,534, and 1,789 pairs in training, development, and test sets, respectively.
Hardware Specification	Yes	We conduct experiments on 8 A100 GPUs with a batch size of 64 (32 for 14B-sized LLM).
Software Dependencies	No	We use the Llama-Factory framework (Zheng et al., 2024) to conduct SFT and MSFT for reproducibility. The paper does not provide specific version numbers for the Llama-Factory framework or other key software components like Python or PyTorch.
Experiment Setup	Yes	We conduct experiments on 8 A100 GPUs with a batch size of 64 (32 for 14B-sized LLM). LLMs are fine-tuned for two epochs using Adam W with the learning rate of 1e-5 that decayed to 0 at the end of training by a cosine scheduler. During inference, the temperature is set to 0.01 to ensure reproducibility.