Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
Authors: Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin SU, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, Tao Yu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations indicate that based on o1-preview, our code agent framework successfully solves only 21.3% of the tasks, compared with 91.2% on Spider 1.0 and 73.0% on BIRD. Our results on Spider 2.0 show that while language models have demonstrated remarkable performance in code generation especially in prior text-to-SQL benchmarks they require significant improvement in order to achieve adequate performance for real-world enterprise usage. |
| Researcher Affiliation | Collaboration | University of Hong Kong Salesforce Research Sea AI Lab Google Deepmind Google Cloud AI Research University of Waterloo |
| Pseudocode | No | The paper describes the 'ANNOTATION PIPELINE' in six numbered steps, but these are descriptive phases of work rather than structured pseudocode or an algorithm block. For example, '1) Database and SQL collection.' is a high-level description, not a code-like procedure. |
| Open Source Code | Yes | Our code, baseline models, and data are available at spider2-sql.github.io. |
| Open Datasets | Yes | Our code, baseline models, and data are available at spider2-sql.github.io. |
| Dataset Splits | No | Spider 2.0-lite is not divided into train and dev sets, we manually select representative examples from the same SQL dialect as the SQL to be predicted, with distinct characteristics (encompassing multiple CTE or nested queries, or requiring intricate data processing) to serve as few-shot examples. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It mentions using 'LLMs' and 'code agent frameworks' but no underlying hardware specifications. |
| Software Dependencies | No | The paper mentions 'DBT' (data build tool) and using 'Python or Shell' for command-line scripts, but it does not specify version numbers for these or any other software libraries or frameworks. It lists various LLMs evaluated, but not their specific versions as software dependencies for the authors' methodology. |
| Experiment Setup | Yes | LLMs. ... we use a temperature of 0.0 and truncate from the beginning of the input if still exceeding the max tokens limit required by the models. ... Code agent frameworks. ...We use a temperature of 1.0 and top-p of 0.9 and truncate from the beginning of the input if still exceeding the max tokens limit required by the models. ... We heuristically request the agents to complete the tasks within a max step limit of 30... |