ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL
Authors: Yang Qin, Chao Chen, Zhihang Fu, Ze Chen, Dezhong Peng, Peng Hu, Jieping Ye
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments and in-depth analyses have been performed on eight open-source LLMs and five widely-used benchmarks. The results demonstrate that our proposal outperforms the latest Text2SQL methods and yields promising performance. |
| Researcher Affiliation | Academia | 1Sichuan University, 2Independent Researcher, 3Tianfu Jincheng Laboratory |
| Pseudocode | Yes | A.9 ALGORITHM OF MCP In this appendix, to make our MCP clearer, we describe the pipeline in detail in Algorithm 1. Algorithm 1 The algorithm of MCP |
| Open Source Code | No | The code and data are available at here. Finally, we have released the code and synthetic data at here for reproducibility, thus further advancing the Text2SQL community. |
| Open Datasets | Yes | To evaluate our method, we conduct extensive experiments on five benchmarks to verify the effectiveness of our method. These benchmark includes two widely-used cross-domain benchmarks, i.e., SPIDER (Yu et al., 2018) and BIRD (Li et al., 2024c), and three robust benchmarks derived from SPIDER, i.e., SPIDER-SYN (Gan et al., 2021a), SPIDER-DK (Gan et al., 2021b), and SPIDER-Realistic (Deng et al., 2020). |
| Dataset Splits | Yes | SPIDER consists of 7,000 Text-SQL pairs in the training set, 1,034 pairs in the development set, and 2,147 pairs in the test set, which covers nearly 200 databases and 138 domains. BIRD is a recently proposed benchmark including 9,428, 1,534, and 1,789 pairs in training, development, and test sets, respectively. |
| Hardware Specification | Yes | We conduct experiments on 8 A100 GPUs with a batch size of 64 (32 for 14B-sized LLM). |
| Software Dependencies | No | We use the Llama-Factory framework (Zheng et al., 2024) to conduct SFT and MSFT for reproducibility. The paper does not provide specific version numbers for the Llama-Factory framework or other key software components like Python or PyTorch. |
| Experiment Setup | Yes | We conduct experiments on 8 A100 GPUs with a batch size of 64 (32 for 14B-sized LLM). LLMs are fine-tuned for two epochs using Adam W with the learning rate of 1e-5 that decayed to 0 at the end of training by a cosine scheduler. During inference, the temperature is set to 0.01 to ensure reproducibility. |