reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing SQL Query Generation with Neurosymbolic Reasoning

Authors: Henrijs Princis, Cristina David, Alan Mycroft

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Xander, our open-source implementation, show it both reduces runtime and increases accuracy of the generated SQL. A specific result is an LM using Xander outperforming a four-times-larger LM. ... The results of this experiment can be seen in Table 1.
Researcher Affiliation	Academia	Henrijs Princis1, Cristina David2, Alan Mycroft1 1University of Cambridge, Cambridge CB3 0FD, UK 2University of Bristol, Bristol BS8 1QU, UK EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: SQL Query Generation with Xander
Open Source Code	Yes	Code https://github.com/henrijsprincis/Xander
Open Datasets	Yes	Spider dataset https://huggingface.co/datasets/xlangai/spider
Dataset Splits	Yes	We used the Spider (Yu et al. 2018) dataset, which is the most challenging benchmark for the crossdomain and multi-table text-to-SQL. The training set was used to finetune the network and the validation set was used to measure real-world accuracy and runtime. ... After removing those, we have 6779 queries in the training dataset, and 1018 in the validation dataset (compared to 7000 and 1034, respectively, in the original Spider dataset).
Hardware Specification	Yes	Except those for Microsoft Phi-1.5, experiments were performed using Tesla P100 GPU and Intel Xeon 6142 CPU. For Microsoft Phi-1.5, due to the larger model size, we used an Amazon EC2 G5.xlarge instance with an A10G (24GB) GPU.
Software Dependencies	No	Experiments used Python with the Hugging Face transformers library (Wolf et al. 2020).
Experiment Setup	Yes	All networks except Microsoft Phi-1.5 were fitted for 50 epochs with batch size of 10. The Adam (Kingma and Ba 2017) optimiser with a learning rate of 4e 5 was used to find the optimal weights. For Microsoft Phi-1.5, to save memory, batchsize of 1 was used and RMSProp optimiser was used instead of Adam. To account for the larger network size, the learning rate was reduced to 4e 6 and the network was fitted for 5 epochs.