reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Online Prompt Selection for Program Synthesis

Authors: Yixuan Li, Lewis Frampton, Federico Mora, Elizabeth Polgreen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implement an instance of our approach, called CYANEA, and evaluate it on synthesis queries from the literature in ranking function synthesis, from the syntax-guided synthesis competition, and fresh, unseen queries generated from SMT problems. CYANEA solves 37.2% more queries than the best single solver and achieves results within 4% of the virtual best solver.
Researcher Affiliation	Academia	Yixuan Li1, Lewis Frampton1, Federico Mora2, Elizabeth Polgreen1 1University of Edinburgh, UK 2University of California, Berkeley, USA EMAIL
Pseudocode	No	The paper uses flowcharts (Figure 2, Figure 3) to illustrate the system architecture and describes components in prose, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states, "We implement an instance of our approach, called CYANEA..." but does not provide any explicit statement about releasing the source code or a link to a code repository.
Open Datasets	Yes	We evaluate our approach on synthesis queries from the Syntax-Guided Synthesis competition (Alur et al. 2024), ranking function synthesis (Giacobbe, Kroening, and Parsert 2022), and automatically generated from the SMT competition (Parsert and Polgreen 2024).
Dataset Splits	No	The paper mentions, "The total number of queries is 1269. We report the average scores of 20 runs, with the queries randomly shuffled in each run." This describes a re-sampling strategy for evaluation but does not specify explicit train/validation/test splits with percentages or counts.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments.
Software Dependencies	No	The paper mentions "cvc5 (Barbosa et al. 2022) as the SMT solver" and uses "GPT(gpt-3.5-turbo-0125) and Llama(Meta-Llama-3-70B)". While the LLM models are specific, cvc5 is cited without an explicit version number in the text, and the LLMs are model names rather than traditional software library dependencies with version numbers.
Experiment Setup	Yes	We set a total timeout of T = 100 seconds and a total cost budget of C = 100,000. For all k-NN predictors, we set k = 15.