Confidence Estimation for Error Detection in Text-to-SQL Systems
Authors: Oleg Somov, Elena Tutubalina
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and decoder-only Llama 3, thus the designated external entropy-based selective classifier has better performance. The study also reveal that, in terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations. We evaluated our five models on four distinct datasets, using the Fβ score to compare methods |
| Researcher Affiliation | Collaboration | 1AIRI, Moscow, Russia 2MIPT, Dolgoprudny, Russia 3Sber AI, Moscow, Russia 4ISP RAS Research Center for Trusted Artificial Intelligence, Moscow, Russia EMAIL, EMAIL |
| Pseudocode | No | The paper describes mathematical formulations and heuristics for uncertainty estimation (Equations 1 and 2), selective prediction, and calibration methods (Equations 4 and optimization problem for Isotonic Regression), but it does not present these as structured pseudocode or algorithm blocks with distinct steps. |
| Open Source Code | Yes | Code https://github.com/runnerup96/error-detection-intext2sql |
| Open Datasets | Yes | To study compositional and domain generalization in Text-to-SQL, several benchmarks and datasets have been developed over the years to better approximate real-world scenarios, address various aspects of model performance: complex queries involving join statements across multiple tables (Yu et al. 2018), new and unseen database schemas (Gan, Chen, and Purver 2021; Lee, Polozov, and Richardson 2021), compositional train and test splits (Shaw et al. 2021; Finegan-Dollak et al. 2018), robustness test sets (Bakshandaeva et al. 2022; Chang et al. 2023), dirty schema values and external knowledge requirements (Li et al. 2024; Wretblad et al. 2024), domain-specific datasets that feature unanswerable questions (Lee et al. 2022). We apply T5 (Raffel et al. 2020), GPT 4 (Achiam et al. 2023), and Llama 3 (Meta 2024) with a reject option1 over popular SPIDER (Yu et al. 2018) and EHRSQL (Lee et al. 2022), covering general-domain and clinical domains. To evaluate such distribution shifts, we leverage two Text-to-SQL datasets: SPIDER-based PAUQ (Bakshandaeva et al. 2022) and EHRSQL (Lee et al. 2022). |
| Dataset Splits | Yes | PAUQ in cross-database setting This setting uses the original SPIDER dataset split, where the data is divided between training and testing sets with no overlap in database structures. ... PAUQ with template shift in single database setting ... This split forces the model to demonstrate its systematicity ability the ability to recombine known SQL syntax elements from Dtr to form novel SQL structures in Dtst. ... PAUQ with target length shift in single database setting ... Shorter samples are placed in Dtr, and longer samples in Dtst, ensuring that all test tokens appear at least once in Dtr. ... EHRSQL with unanswerable questions This setting uses the original EHRSQL split. |
| Hardware Specification | Yes | All experiments were conducted on four A100 80GB GPUs. |
| Software Dependencies | No | The Ethics Statement mentions "Our Py Torch/Hugging Face code will be released with the paper", indicating the use of PyTorch and Hugging Face. However, no specific version numbers for these or other software dependencies are provided. |
| Experiment Setup | Yes | The hyperparameters of the fine-tuning are specified in Appendix A. |