ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

Authors: Yang Qin, Chao Chen, Zhihang Fu, Ze Chen, Dezhong Peng, Peng Hu, Jieping Ye

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments and in-depth analyses have been performed on eight open-source LLMs and five widely-used benchmarks. The results demonstrate that our proposal outperforms the latest Text2SQL methods and yields promising performance.
Researcher Affiliation Academia 1Sichuan University, 2Independent Researcher, 3Tianfu Jincheng Laboratory
Pseudocode Yes A.9 ALGORITHM OF MCP In this appendix, to make our MCP clearer, we describe the pipeline in detail in Algorithm 1. Algorithm 1 The algorithm of MCP
Open Source Code No The code and data are available at here. Finally, we have released the code and synthetic data at here for reproducibility, thus further advancing the Text2SQL community.
Open Datasets Yes To evaluate our method, we conduct extensive experiments on five benchmarks to verify the effectiveness of our method. These benchmark includes two widely-used cross-domain benchmarks, i.e., SPIDER (Yu et al., 2018) and BIRD (Li et al., 2024c), and three robust benchmarks derived from SPIDER, i.e., SPIDER-SYN (Gan et al., 2021a), SPIDER-DK (Gan et al., 2021b), and SPIDER-Realistic (Deng et al., 2020).
Dataset Splits Yes SPIDER consists of 7,000 Text-SQL pairs in the training set, 1,034 pairs in the development set, and 2,147 pairs in the test set, which covers nearly 200 databases and 138 domains. BIRD is a recently proposed benchmark including 9,428, 1,534, and 1,789 pairs in training, development, and test sets, respectively.
Hardware Specification Yes We conduct experiments on 8 A100 GPUs with a batch size of 64 (32 for 14B-sized LLM).
Software Dependencies No We use the Llama-Factory framework (Zheng et al., 2024) to conduct SFT and MSFT for reproducibility. The paper does not provide specific version numbers for the Llama-Factory framework or other key software components like Python or PyTorch.
Experiment Setup Yes We conduct experiments on 8 A100 GPUs with a batch size of 64 (32 for 14B-sized LLM). LLMs are fine-tuned for two epochs using Adam W with the learning rate of 1e-5 that decayed to 0 at the end of training by a cosine scheduler. During inference, the temperature is set to 0.01 to ensure reproducibility.