reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling

Authors: Zhicheng YANG, Yiwei Wang, Yinya Huang, Zhijiang Guo, Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, Jing Tang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that RESOCRATIC-29K significantly improves the performance of open-source models. We conducted an in-depth evaluation of a range of LLMs under various settings. Table 2: Main results on OPTIBENCH. Table 3: Ablation study on synthetic data.
Researcher Affiliation	Collaboration	1The Hong Kong University of Science and Technology (Guangzhou) 2The Hong Kong University of Science and Technology 3University of California, Merced 4ETH Zurich 5City University of Hong Kong 6Huawei Noah s Ark Lab 7Sun Yat-sen University 8MBZUAI 9Chongqing University
Pseudocode	No	The paper includes Python code snippets for solving optimization problems and implementing prompts in the main text and appendix (e.g., Figure 2, Appendix E), but it does not contain a block explicitly labeled "Pseudocode" or "Algorithm" for its methodology.
Open Source Code	No	In the future, we plan to extend Re Socratic to other complex reasoning tasks such as math word problem-solving and evaluate more large language models on our proposed OPTIBENCH benchmark. This paper uses PySCIPOpt as a solver, which is a third-party tool, but does not provide specific access information (like a URL or explicit statement) for the source code of their own Re Socratic method or benchmark implementation.
Open Datasets	No	In this work, we propose OPTIBENCH, a benchmark for end-to-end optimization problem-solving with human-readable inputs and outputs. We collect 29k samples with Re Socratic, resulting in the RESOCRATIC-29K dataset. While the paper introduces and synthesizes the OPTIBENCH and RESOCRATIC-29K datasets, it does not provide explicit access information such as a link, DOI, or repository name for public download.
Dataset Splits	No	We evaluate LLMs under three settings: Zero-shot, Few-shot, and Supervised Fine-Tuning (SFT) setting. For a given language model, we utilize our contributed RESOCRATIC-29K to conduct supervised fine-tuning. The paper mentions using OPTIBENCH for evaluation and RESOCRATIC-29K for fine-tuning, implying a training and test set, but it does not specify explicit percentages, sample counts, or detailed methodologies for these or any validation splits.
Hardware Specification	Yes	Based on this, we conduct fine-tuning experiments on two A800 GPUs, the epoch is set as 3, the learning rate is 2e 5, and the batch size is 128.
Software Dependencies	No	We require our workers to write Python code, call the pyscipopt2 solver to solve each problem, and ask them to output the values of the variables and optimization targets at the end of the code. The code examples in the appendix (e.g., E.1.1) use 'import pyscipopt' and 'import math'. However, specific version numbers for Python, PySCIPOpt, or other libraries are not provided.
Experiment Setup	Yes	For a given language model, we utilize our contributed RESOCRATIC-29K to conduct supervised fine-tuning. Based on this, we conduct fine-tuning experiments on two A800 GPUs, the epoch is set as 3, the learning rate is 2e 5, and the batch size is 128. The threshold of the aforementioned similarity filter is set at 0.7, we also set the temperature as 0.7, and sample 50 responses for each query.