reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Autoformulation of Mathematical Optimization Models Using LLMs

Authors: Nicolás Astorga, Tennison Liu, Yuanzhang Xiao, Mihaela Van Der Schaar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical analysis on linear and mixed-integer programming benchmarks demonstrates our method s effectiveness, with significant performance gains from both LLM-based value estimation and symbolic pruning techniques.
Researcher Affiliation	Academia	1DAMTP, University of Cambridge, Cambridge, UK 2ECE, University of Hawaii at Manoa, Honolulu, USA. Correspondence to: Nicol as Astorga, Tennison Liu <EMAIL>.
Pseudocode	No	The paper only describes the MCTS algorithm steps in narrative text, without a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	We provide the code to reproduce our results at https://github. com/jumpynitro/Auto Formulator.1
Open Datasets	Yes	We evaluate our methods on four real-world benchmarks: NLP4OPT (Ramamonjison et al., 2023), a curated set of 244 linear programming problems (based on (Tang et al., 2024)); Industry OR (Tang et al., 2024), consisting of 100 problems spanning linear, integer, and mixed-integer programming at various difficulty levels; Complex OR (Xiao et al., 2023), with 37 real-world operations research problems from diverse domains; and MAMO (Huang et al., 2024b), using the more advanced Complex LP subset, which includes 211 problems.
Dataset Splits	No	The paper refers to existing benchmarks (NLP4OPT, Industry OR, Complex OR, MAMO) and reports accuracy, but does not explicitly describe any specific training/test/validation dataset splits used for its experiments or by the benchmarks themselves in the context of the reported results.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions several solvers (Gurobi, CVXPY, SMT solvers, TRCA, SLSQP, COBYLA, COBYQA, CLARABEL, ECOS, SCS, OSQP) but does not provide specific version numbers for these software dependencies as required for reproducibility.
Experiment Setup	Yes	We configure our method with H = 10 candidate formulations, I = 3 children retained after pruning and scoring, and T = 16 total rollouts.