reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generating Streamlining Constraints with Large Language Models

Authors: Florentina Voboril, Vaidyanathan Peruvemba Ramaswamy, Stefan Szeider

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluated across seven diverse constraint satisfaction problems, our method achieves substantial runtime reductions. ... We evaluate Stream LLM extensively across seven diverse constraint problems, including standard benchmarks and a novel hypergraph coloring problem, using hundreds of instances specified in Mini Zinc. The problems are formulated in the Mini Zinc constraint programming language (Nethercote et al. 2007). We test Stream LLM with two state-of-the-art LLMs (GPT-4o and Claude 3.5 Sonnet), different prompting variants, and different strategies for combining adaptive prompting with test runs on easy instances. Our results demonstrate remarkable effectiveness, with some streamliners reducing solving time by up to 99%.
Researcher Affiliation	Academia	FLORENTINA VOBORIL, VAIDYANATHAN PERUVEMBA RAMASWAMY, and STEFAN SZEIDER, Technische Universität Wien, Austria
Pseudocode	Yes	Algorithm 1 outlines the general strategy, which can be summarized as follows.
Open Source Code	Yes	The source code and benchmark instances can be found on Zenodo (Voboril, P R, et al. 2025).
Open Datasets	Yes	All instances, Mini Zinc models, and the Python implementation of Stream LLM (which includes the prompts) are available on Zenodo2. https://zenodo.org/doi/10.5281/zenodo.13331048
Dataset Splits	Yes	For each problem, we create 15 training instances that take less than 10 seconds. Further, we create at least 50 test instances that take between 10 minutes and 2 hours to solve. ... The offline approach investigates whether extended training time improves streamliner quality. We increase the timeout from 10 minutes to 4 hours and expand the training set from 15 instances solvable within 10 seconds to 35 instances with solving times between 1 and 30 seconds.
Hardware Specification	Yes	We evaluate the running times of the test instances on compute nodes with two 2.4 GHz 10-core Intel Xeon E5-2640 v4 processors with 80 GB of RAM each.
Software Dependencies	Yes	We use Mini Zinc 2.8.3 and the Chuffed 0.13.1 solver. As LLMs we use GPT-4o (gpt-4o-2024-11-20) and Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) and access them via the openai 1.29.0 and anthropic 0.25.8 packages in Python 3.11.5.
Experiment Setup	Yes	Our system submits several queries (prompts) to the LLM that result in candidate streamliners and tests the candidate streamliners on a small set of ntrain easy training instances that can be solved in less than ttrain seconds. Rigorous testing on several problems indicates that a relatively small number of test instances suffices (about 15) and even small and easy test instances (solvable in under 10 seconds with the unstreamlined model) provide a good indication of how well streamliners will work on large and hard instances; we lay out these experiments in more detail in Sections 4.5 and 4.6. ... Algorithm 1 outlines the general strategy... Then, for about ten minutes, the algorithm repeatedly prompts the LLM to generate candidate streamliners and evaluates their performance on the training instances. ... We use five sets of training instances with an upper bound of ttrain {10, 20, 30, 60} seconds, respectively. The experiments show that ttrain has no influence on the one streamliner that was picked in the end, hence we settled on the upper bound ttrain = 10 for the following experiments. ... setting ntrain = 15 seems a fair compromise for the forthcoming experiments.