reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Confidence Estimation for Error Detection in Text-to-SQL Systems

Authors: Oleg Somov, Elena Tutubalina

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results show that encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and decoder-only Llama 3, thus the designated external entropy-based selective classiﬁer has better performance. The study also reveal that, in terms of error detection, selective classiﬁer with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations. We evaluated our ﬁve models on four distinct datasets, using the Fβ score to compare methods
Researcher Affiliation	Collaboration	1AIRI, Moscow, Russia 2MIPT, Dolgoprudny, Russia 3Sber AI, Moscow, Russia 4ISP RAS Research Center for Trusted Artiﬁcial Intelligence, Moscow, Russia EMAIL, EMAIL
Pseudocode	No	The paper describes mathematical formulations and heuristics for uncertainty estimation (Equations 1 and 2), selective prediction, and calibration methods (Equations 4 and optimization problem for Isotonic Regression), but it does not present these as structured pseudocode or algorithm blocks with distinct steps.
Open Source Code	Yes	Code https://github.com/runnerup96/error-detection-intext2sql
Open Datasets	Yes	To study compositional and domain generalization in Text-to-SQL, several benchmarks and datasets have been developed over the years to better approximate real-world scenarios, address various aspects of model performance: complex queries involving join statements across multiple tables (Yu et al. 2018), new and unseen database schemas (Gan, Chen, and Purver 2021; Lee, Polozov, and Richardson 2021), compositional train and test splits (Shaw et al. 2021; Finegan-Dollak et al. 2018), robustness test sets (Bakshandaeva et al. 2022; Chang et al. 2023), dirty schema values and external knowledge requirements (Li et al. 2024; Wretblad et al. 2024), domain-speciﬁc datasets that feature unanswerable questions (Lee et al. 2022). We apply T5 (Raffel et al. 2020), GPT 4 (Achiam et al. 2023), and Llama 3 (Meta 2024) with a reject option1 over popular SPIDER (Yu et al. 2018) and EHRSQL (Lee et al. 2022), covering general-domain and clinical domains. To evaluate such distribution shifts, we leverage two Text-to-SQL datasets: SPIDER-based PAUQ (Bakshandaeva et al. 2022) and EHRSQL (Lee et al. 2022).
Dataset Splits	Yes	PAUQ in cross-database setting This setting uses the original SPIDER dataset split, where the data is divided between training and testing sets with no overlap in database structures. ... PAUQ with template shift in single database setting ... This split forces the model to demonstrate its systematicity ability the ability to recombine known SQL syntax elements from Dtr to form novel SQL structures in Dtst. ... PAUQ with target length shift in single database setting ... Shorter samples are placed in Dtr, and longer samples in Dtst, ensuring that all test tokens appear at least once in Dtr. ... EHRSQL with unanswerable questions This setting uses the original EHRSQL split.
Hardware Specification	Yes	All experiments were conducted on four A100 80GB GPUs.
Software Dependencies	No	The Ethics Statement mentions "Our Py Torch/Hugging Face code will be released with the paper", indicating the use of PyTorch and Hugging Face. However, no specific version numbers for these or other software dependencies are provided.
Experiment Setup	Yes	The hyperparameters of the ﬁne-tuning are speciﬁed in Appendix A.