reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MAGIC: Generating Self-Correction Guideline for In-Context Text-to-SQL

Authors: Arian Askari, Christian Poelitz, Xinye Tang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments show that MAGIC s guideline outperforms expert human s created ones. We empirically find out that the guideline produced by MAGIC enhances the interpretability of the corrections made, providing insights in analyzing the reason behind the failures and successes of LLMs in self-correction.
Researcher Affiliation	Collaboration	1Leiden University 2Microsoft Research Cambridge, UK, 3Microsoft Redmond EMAIL, EMAIL
Pseudocode	No	The paper includes block diagrams (Figure 2) and structured prompt templates (Figures 3, 4, 5, 6) but not formal pseudocode or algorithm blocks describing the method's steps.
Open Source Code	Yes	We publish all code to reproduce our experiments as open source.1 1https://github.com/microsoft/SynQo
Open Datasets	Yes	Datasets https://huggingface.co/datasets/microsoft/MAGIC The Spider (Yu et al. 2018) dataset... The BIRD dataset (Li et al. 2023)...
Dataset Splits	Yes	The Spider (Yu et al. 2018) dataset is a collection of 10,181 questions and 5,693 unique complex SQL queries across 200 databases in 138 domains, with each domain featuring multiple tables. It is divided into training, development, and test sets with 8,659, 1,034, and 2,147 examples, respectively, across 146, 20, and 34 distinct databases, ensuring no overlap between sets.
Hardware Specification	No	The paper does not explicitly mention any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions models and frameworks like 'GPT-4' and 'DIN-SQL', but does not provide specific ancillary software dependencies with version numbers (e.g., Python, PyTorch, CUDA, or specific library versions).
Experiment Setup	Yes	We set 5 as maximum number of iteration. We determined that a feedback batch size of 10 is optimal. For self-consistency (Wang et al. 2022; Gao et al. 2023), we generate 20 SQL queries... For the Multiple-Prompt baseline, we follow the approach in (Lee et al. 2024) by reordering candidate tables in the prompt and generating up to 20 different combinations...