Confidence Estimation for Error Detection in Text-to-SQL Systems

Authors: Oleg Somov, Elena Tutubalina

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and decoder-only Llama 3, thus the designated external entropy-based selective classifier has better performance. The study also reveal that, in terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations. We evaluated our five models on four distinct datasets, using the Fβ score to compare methods
Researcher Affiliation Collaboration 1AIRI, Moscow, Russia 2MIPT, Dolgoprudny, Russia 3Sber AI, Moscow, Russia 4ISP RAS Research Center for Trusted Artificial Intelligence, Moscow, Russia EMAIL, EMAIL
Pseudocode No The paper describes mathematical formulations and heuristics for uncertainty estimation (Equations 1 and 2), selective prediction, and calibration methods (Equations 4 and optimization problem for Isotonic Regression), but it does not present these as structured pseudocode or algorithm blocks with distinct steps.
Open Source Code Yes Code https://github.com/runnerup96/error-detection-intext2sql
Open Datasets Yes To study compositional and domain generalization in Text-to-SQL, several benchmarks and datasets have been developed over the years to better approximate real-world scenarios, address various aspects of model performance: complex queries involving join statements across multiple tables (Yu et al. 2018), new and unseen database schemas (Gan, Chen, and Purver 2021; Lee, Polozov, and Richardson 2021), compositional train and test splits (Shaw et al. 2021; Finegan-Dollak et al. 2018), robustness test sets (Bakshandaeva et al. 2022; Chang et al. 2023), dirty schema values and external knowledge requirements (Li et al. 2024; Wretblad et al. 2024), domain-specific datasets that feature unanswerable questions (Lee et al. 2022). We apply T5 (Raffel et al. 2020), GPT 4 (Achiam et al. 2023), and Llama 3 (Meta 2024) with a reject option1 over popular SPIDER (Yu et al. 2018) and EHRSQL (Lee et al. 2022), covering general-domain and clinical domains. To evaluate such distribution shifts, we leverage two Text-to-SQL datasets: SPIDER-based PAUQ (Bakshandaeva et al. 2022) and EHRSQL (Lee et al. 2022).
Dataset Splits Yes PAUQ in cross-database setting This setting uses the original SPIDER dataset split, where the data is divided between training and testing sets with no overlap in database structures. ... PAUQ with template shift in single database setting ... This split forces the model to demonstrate its systematicity ability the ability to recombine known SQL syntax elements from Dtr to form novel SQL structures in Dtst. ... PAUQ with target length shift in single database setting ... Shorter samples are placed in Dtr, and longer samples in Dtst, ensuring that all test tokens appear at least once in Dtr. ... EHRSQL with unanswerable questions This setting uses the original EHRSQL split.
Hardware Specification Yes All experiments were conducted on four A100 80GB GPUs.
Software Dependencies No The Ethics Statement mentions "Our Py Torch/Hugging Face code will be released with the paper", indicating the use of PyTorch and Hugging Face. However, no specific version numbers for these or other software dependencies are provided.
Experiment Setup Yes The hyperparameters of the fine-tuning are specified in Appendix A.