Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

Authors: Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin SU, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, Tao Yu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations indicate that based on o1-preview, our code agent framework successfully solves only 21.3% of the tasks, compared with 91.2% on Spider 1.0 and 73.0% on BIRD. Our results on Spider 2.0 show that while language models have demonstrated remarkable performance in code generation especially in prior text-to-SQL benchmarks they require significant improvement in order to achieve adequate performance for real-world enterprise usage.
Researcher Affiliation Collaboration University of Hong Kong Salesforce Research Sea AI Lab Google Deepmind Google Cloud AI Research University of Waterloo
Pseudocode No The paper describes the 'ANNOTATION PIPELINE' in six numbered steps, but these are descriptive phases of work rather than structured pseudocode or an algorithm block. For example, '1) Database and SQL collection.' is a high-level description, not a code-like procedure.
Open Source Code Yes Our code, baseline models, and data are available at spider2-sql.github.io.
Open Datasets Yes Our code, baseline models, and data are available at spider2-sql.github.io.
Dataset Splits No Spider 2.0-lite is not divided into train and dev sets, we manually select representative examples from the same SQL dialect as the SQL to be predicted, with distinct characteristics (encompassing multiple CTE or nested queries, or requiring intricate data processing) to serve as few-shot examples.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It mentions using 'LLMs' and 'code agent frameworks' but no underlying hardware specifications.
Software Dependencies No The paper mentions 'DBT' (data build tool) and using 'Python or Shell' for command-line scripts, but it does not specify version numbers for these or any other software libraries or frameworks. It lists various LLMs evaluated, but not their specific versions as software dependencies for the authors' methodology.
Experiment Setup Yes LLMs. ... we use a temperature of 0.0 and truncate from the beginning of the input if still exceeding the max tokens limit required by the models. ... Code agent frameworks. ...We use a temperature of 1.0 and top-p of 0.9 and truncate from the beginning of the input if still exceeding the max tokens limit required by the models. ... We heuristically request the agents to complete the tasks within a max step limit of 30...