reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Guiding Large Language Models in Modeling Optimization Problems via Question Partitioning

Authors: Xiaotian Pan, Junhao Fang, Feng Wu, Sijia Zhang, Yi-Xiang Hu, Shaoang Li, Xiang-Yang Li

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments demonstrate that our method improves performance on the common benchmark dataset NLP4LP, achieving an accuracy of 62.3% and a code executability rate of 86.8% when tested on GPT-4. Additionally, we demonstrate the effectiveness of our Pa MOP in handling large real-world problems. Experiments on the NLP4LP dataset demonstrate that Pa MOP achieves an accuracy of 62.3% and a code executability rate of 86.8%, both outperforming existing methods. Ablation studies further confirm the importance of using the partition tree in enhancing model performance.
Researcher Affiliation	Academia	School of Computer Science and Technology, University of Science and Technology of China EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes in natural language and mathematical formulations, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code	No	The paper does not contain an unambiguous statement that the authors are releasing their code, nor does it provide a direct link to a source-code repository.
Open Datasets	Yes	The NLP4LP dataset [Ahmadi Teshnizi et al., 2024] is collected from optimization textbooks and manuals. It includes problems such as network flow, scheduling, combinatorial optimization, and more. In total, it contains 54 LP problems and 13 MILP problems.
Dataset Splits	No	The paper mentions using the NLP4LP dataset and a custom set of real-world problems but does not specify exact training, validation, or test splits for these datasets. It notes, "Each example contains a description of the problem, the classification of the problem, the dimensions of the input data, and the data file," and "To adapt the dataset to the AMPL format, we have preprocessed the dataset's data.json into a data.dat version," but no explicit split information is provided.
Hardware Specification	No	The paper mentions testing the system using GPT-4 but does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running their experiments or computations.
Software Dependencies	No	The paper mentions using AMPL and Gurobi but does not specify their version numbers. For example, "We use AMPL [Fourer et al., 1987] for modeling, as it separates the model and data files. Unlike humans, LLMs treat mathematical formulas, modeling languages, and programming languages as different languages, so we directly generate code in the modeling language instead of formulas." and "we use AMPL to call Gurobi to solve the model"
Experiment Setup	Yes	For these experiments, we set the model's temperature to 0.2 (controls randomness of the model's output) and the maximum number of failed iterations to 5.