SQL-PaLM: Improved large language model adaptation for Text-to-SQL
Authors: Ruoxi Sun, Sercan O Arik, Alexandre Muzio, Lesly Miculicich, Satya Kesav Gundabathula, Pengcheng Yin, Hanjun Dai, Hootan Nakhost, Rajarishi Sinha, Zifeng Wang, Tomas Pfister
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our holistic approach yields substantial advancements in Text-to-SQL, as demonstrated on two key public benchmarks, Spider and BIRD. Through comprehensive ablations and error analyses, we shed light on the strengths and weaknesses of our framework, offering valuable insights into Text-to-SQL s future work. ... We systematically explore large models potential for Text-to-SQL and study the research topics along the key aspects presented in Sec. 4. Through extensive experiments and analyses, we unravel multiple key factors that influence the LLMs performance when adapting to Text-to-SQL. |
| Researcher Affiliation | Industry | Ruoxi Sun1 EMAIL Sercan Ö. Arik1 EMAIL Alex Muzio2 EMAIL Lesly Miculicich1 EMAIL Satya Gundabathula2 EMAIL Pengcheng Yin3 EMAIL Hanjun Dai3 EMAIL Hootan Nakhost1 EMAIL Rajarishi Sinha1 EMAIL Zifeng Wang EMAIL Tomas Pfister1 tpfister@google.com 1 Cloud AI Research; 3 Google Cloud; 3 Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 Test-time refinement via execution-based selection 1: Input: Database D. N number of questions. {SQLi}p from P training paradigms. SQL executor E . 2: Output: outputs = [] 3: for i = 1 to N do 4: for j = 1 to P do 5: executions = [] 6: indexes = [] 7: e = E(SQLj i, D) 8: if e == valid then 9: executions e 10: indexes j 11: end if 12: end for 13: outputs {SQLj i|E(SQLj i, D) = arg maxe(counts(executions)), j indexes} Among SQLs without execution error, select the SQL that gives execution output with maximum number of occurrences. 14: end for |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code or a direct link to a code repository for the methodology described in this paper. |
| Open Datasets | Yes | We consider publicly-available large-scale Text-to-SQL benchmarks. Spider (Yu et al., 2018) contains 7000 training samples across 166 databases and 1034 evaluation samples ( Dev split ) across 20 databases from a variety of domains. Spider-SYN (Gan et al., 2021a) is a complex variant of the Spider dev split... Spider-realistic (Deng et al., 2020) samples 508 text-SQL pairs from Spider dev split... Spider-DK (Gan et al., 2021b) samples 535 question-SQL pairs... BIRD (Li et al., 2023d) is a comprehensive dataset containing 9428 question-SQL pairs for train split and 1534 pairs for dev split, across 95 databases totalling a size of 33.4 GB. |
| Dataset Splits | Yes | Spider (Yu et al., 2018) contains 7000 training samples across 166 databases and 1034 evaluation samples ( Dev split ) across 20 databases from a variety of domains. ... BIRD (Li et al., 2023d) is a comprehensive dataset containing 9428 question-SQL pairs for train split and 1534 pairs for dev split, across 95 databases totalling a size of 33.4 GB. |
| Hardware Specification | No | The paper mentions using "Pa LM-2 Unicorn variant" and "Pa LM-2 Gecko" models, and LoRA finetuning parameters, but does not specify any particular hardware like GPU models, CPU types, or cloud computing instances used for running the experiments. |
| Software Dependencies | No | The paper mentions several LLM models and techniques like Pa LM-2, UL2 (Tay et al., 2022), Lo RA (Hu et al., 2021), but it does not specify software dependencies such as programming languages, libraries, or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | For few-shot prompting, we use Spider datasets. For each question, we sample Pa LM-2 32 times with temperature of 0.5. ... For instruction-tuning, ... We train until convergence, and the number of steps is no more than 10K steps. Lo RA finetuning: Following Hu et al. (2021), we incorporate trainable linear low-rank modules into the query and value projections of each self-attention layer. We set the rank of Lo RA to 32, learning rate to 1e-4, and the model architecture to Gecko Pa LM model. |