reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SQL-PaLM: Improved large language model adaptation for Text-to-SQL

Authors: Ruoxi Sun, Sercan O Arik, Alexandre Muzio, Lesly Miculicich, Satya Kesav Gundabathula, Pengcheng Yin, Hanjun Dai, Hootan Nakhost, Rajarishi Sinha, Zifeng Wang, Tomas Pfister

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our holistic approach yields substantial advancements in Text-to-SQL, as demonstrated on two key public benchmarks, Spider and BIRD. Through comprehensive ablations and error analyses, we shed light on the strengths and weaknesses of our framework, oﬀering valuable insights into Text-to-SQL s future work. ... We systematically explore large models potential for Text-to-SQL and study the research topics along the key aspects presented in Sec. 4. Through extensive experiments and analyses, we unravel multiple key factors that inﬂuence the LLMs performance when adapting to Text-to-SQL.
Researcher Affiliation	Industry	Ruoxi Sun1 EMAIL Sercan Ö. Arik1 EMAIL Alex Muzio2 EMAIL Lesly Miculicich1 EMAIL Satya Gundabathula2 EMAIL Pengcheng Yin3 EMAIL Hanjun Dai3 EMAIL Hootan Nakhost1 EMAIL Rajarishi Sinha1 EMAIL Zifeng Wang EMAIL Tomas Pﬁster1 tpﬁster@google.com 1 Cloud AI Research; 3 Google Cloud; 3 Google Deep Mind
Pseudocode	Yes	Algorithm 1 Test-time reﬁnement via execution-based selection 1: Input: Database D. N number of questions. {SQLi}p from P training paradigms. SQL executor E . 2: Output: outputs = [] 3: for i = 1 to N do 4: for j = 1 to P do 5: executions = [] 6: indexes = [] 7: e = E(SQLj i, D) 8: if e == valid then 9: executions e 10: indexes j 11: end if 12: end for 13: outputs {SQLj i\|E(SQLj i, D) = arg maxe(counts(executions)), j indexes} Among SQLs without execution error, select the SQL that gives execution output with maximum number of occurrences. 14: end for
Open Source Code	No	The paper does not contain an explicit statement about releasing code or a direct link to a code repository for the methodology described in this paper.
Open Datasets	Yes	We consider publicly-available large-scale Text-to-SQL benchmarks. Spider (Yu et al., 2018) contains 7000 training samples across 166 databases and 1034 evaluation samples ( Dev split ) across 20 databases from a variety of domains. Spider-SYN (Gan et al., 2021a) is a complex variant of the Spider dev split... Spider-realistic (Deng et al., 2020) samples 508 text-SQL pairs from Spider dev split... Spider-DK (Gan et al., 2021b) samples 535 question-SQL pairs... BIRD (Li et al., 2023d) is a comprehensive dataset containing 9428 question-SQL pairs for train split and 1534 pairs for dev split, across 95 databases totalling a size of 33.4 GB.
Dataset Splits	Yes	Spider (Yu et al., 2018) contains 7000 training samples across 166 databases and 1034 evaluation samples ( Dev split ) across 20 databases from a variety of domains. ... BIRD (Li et al., 2023d) is a comprehensive dataset containing 9428 question-SQL pairs for train split and 1534 pairs for dev split, across 95 databases totalling a size of 33.4 GB.
Hardware Specification	No	The paper mentions using "Pa LM-2 Unicorn variant" and "Pa LM-2 Gecko" models, and LoRA finetuning parameters, but does not specify any particular hardware like GPU models, CPU types, or cloud computing instances used for running the experiments.
Software Dependencies	No	The paper mentions several LLM models and techniques like Pa LM-2, UL2 (Tay et al., 2022), Lo RA (Hu et al., 2021), but it does not specify software dependencies such as programming languages, libraries, or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	For few-shot prompting, we use Spider datasets. For each question, we sample Pa LM-2 32 times with temperature of 0.5. ... For instruction-tuning, ... We train until convergence, and the number of steps is no more than 10K steps. Lo RA ﬁnetuning: Following Hu et al. (2021), we incorporate trainable linear low-rank modules into the query and value projections of each self-attention layer. We set the rank of Lo RA to 32, learning rate to 1e-4, and the model architecture to Gecko Pa LM model.