reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

Authors: Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin SU, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, Tao Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluations indicate that based on o1-preview, our code agent framework successfully solves only 21.3% of the tasks, compared with 91.2% on Spider 1.0 and 73.0% on BIRD. Our results on Spider 2.0 show that while language models have demonstrated remarkable performance in code generation especially in prior text-to-SQL benchmarks they require significant improvement in order to achieve adequate performance for real-world enterprise usage.
Researcher Affiliation	Collaboration	University of Hong Kong Salesforce Research Sea AI Lab Google Deepmind Google Cloud AI Research University of Waterloo
Pseudocode	No	The paper describes the 'ANNOTATION PIPELINE' in six numbered steps, but these are descriptive phases of work rather than structured pseudocode or an algorithm block. For example, '1) Database and SQL collection.' is a high-level description, not a code-like procedure.
Open Source Code	Yes	Our code, baseline models, and data are available at spider2-sql.github.io.
Open Datasets	Yes	Our code, baseline models, and data are available at spider2-sql.github.io.
Dataset Splits	No	Spider 2.0-lite is not divided into train and dev sets, we manually select representative examples from the same SQL dialect as the SQL to be predicted, with distinct characteristics (encompassing multiple CTE or nested queries, or requiring intricate data processing) to serve as few-shot examples.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It mentions using 'LLMs' and 'code agent frameworks' but no underlying hardware specifications.
Software Dependencies	No	The paper mentions 'DBT' (data build tool) and using 'Python or Shell' for command-line scripts, but it does not specify version numbers for these or any other software libraries or frameworks. It lists various LLMs evaluated, but not their specific versions as software dependencies for the authors' methodology.
Experiment Setup	Yes	LLMs. ... we use a temperature of 0.0 and truncate from the beginning of the input if still exceeding the max tokens limit required by the models. ... Code agent frameworks. ...We use a temperature of 1.0 and top-p of 0.9 and truncate from the beginning of the input if still exceeding the max tokens limit required by the models. ... We heuristically request the agents to complete the tasks within a max step limit of 30...