OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling
Authors: Zhicheng YANG, Yiwei Wang, Yinya Huang, Zhijiang Guo, Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, Jing Tang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that RESOCRATIC-29K significantly improves the performance of open-source models. We conducted an in-depth evaluation of a range of LLMs under various settings. Table 2: Main results on OPTIBENCH. Table 3: Ablation study on synthetic data. |
| Researcher Affiliation | Collaboration | 1The Hong Kong University of Science and Technology (Guangzhou) 2The Hong Kong University of Science and Technology 3University of California, Merced 4ETH Zurich 5City University of Hong Kong 6Huawei Noah s Ark Lab 7Sun Yat-sen University 8MBZUAI 9Chongqing University |
| Pseudocode | No | The paper includes Python code snippets for solving optimization problems and implementing prompts in the main text and appendix (e.g., Figure 2, Appendix E), but it does not contain a block explicitly labeled "Pseudocode" or "Algorithm" for its methodology. |
| Open Source Code | No | In the future, we plan to extend Re Socratic to other complex reasoning tasks such as math word problem-solving and evaluate more large language models on our proposed OPTIBENCH benchmark. This paper uses PySCIPOpt as a solver, which is a third-party tool, but does not provide specific access information (like a URL or explicit statement) for the source code of their own Re Socratic method or benchmark implementation. |
| Open Datasets | No | In this work, we propose OPTIBENCH, a benchmark for end-to-end optimization problem-solving with human-readable inputs and outputs. We collect 29k samples with Re Socratic, resulting in the RESOCRATIC-29K dataset. While the paper introduces and synthesizes the OPTIBENCH and RESOCRATIC-29K datasets, it does not provide explicit access information such as a link, DOI, or repository name for public download. |
| Dataset Splits | No | We evaluate LLMs under three settings: Zero-shot, Few-shot, and Supervised Fine-Tuning (SFT) setting. For a given language model, we utilize our contributed RESOCRATIC-29K to conduct supervised fine-tuning. The paper mentions using OPTIBENCH for evaluation and RESOCRATIC-29K for fine-tuning, implying a training and test set, but it does not specify explicit percentages, sample counts, or detailed methodologies for these or any validation splits. |
| Hardware Specification | Yes | Based on this, we conduct fine-tuning experiments on two A800 GPUs, the epoch is set as 3, the learning rate is 2e 5, and the batch size is 128. |
| Software Dependencies | No | We require our workers to write Python code, call the pyscipopt2 solver to solve each problem, and ask them to output the values of the variables and optimization targets at the end of the code. The code examples in the appendix (e.g., E.1.1) use 'import pyscipopt' and 'import math'. However, specific version numbers for Python, PySCIPOpt, or other libraries are not provided. |
| Experiment Setup | Yes | For a given language model, we utilize our contributed RESOCRATIC-29K to conduct supervised fine-tuning. Based on this, we conduct fine-tuning experiments on two A800 GPUs, the epoch is set as 3, the learning rate is 2e 5, and the batch size is 128. The threshold of the aforementioned similarity filter is set at 0.7, we also set the temperature as 0.7, and sample 50 responses for each query. |