reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch

Authors: caigao jiang, Xiang Shu, Hong Qian, Xingyu Lu, JUN ZHOU, Aimin Zhou, Yang Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the optimization generalization ability of LLMOPT and compared methods across six real-world datasets covering roughly 20 fields such as health, environment, energy and manufacturing, etc. Extensive experiment results show that LLMOPT is able to model various optimization problem types such as linear/nonlinear programming, mixed integer programming, and combinatorial optimization, and achieves a notable 11.08% average solving accuracy improvement compared with the state-of-the-art methods.
Researcher Affiliation	Collaboration	East China Normal University, China Ant Group, China Nanjing University, China
Pseudocode	No	The paper describes the LLMOPT framework, its components (data, learning, auto-testing), and processes like multi-instruction SFT and model alignment in detail. However, it does not present these processes or any other methodology in the form of a structured pseudocode block or algorithm.
Open Source Code	Yes	The code is available at https://github.com/caigaojiang/LLMOPT.
Open Datasets	Yes	Firstly, we collect almost all existing optimization problem datasets, including NL4Opt (Ramamonjison et al., 2021), Mamo (Easy LP and Complex LP) (Huang et al., 2024), Industry OR (Tang et al., 2024), NLP4LP (Ahmadi Teshnizi et al., 2024) and Complex OR (Xiao et al., 2024) whose detailed information is introduced in the Appendix A.
Dataset Splits	Yes	Subsequently, 100 samples are randomly selected from each dataset as the reserved test dataset. For datasets with fewer than 100 samples, all data are used for testing. The remaining samples are used for data augmentation, ensuring a clear separation between training and testing data. In the data augmentation process, LLMs effectively generate data through prompt engineering (Tang et al., 2024; Luo et al., 2023). To build a high-quality dataset, seven distinct instructions are applied to 1,763 seed problems. ... Due to the limited data in NLP4LP and Complex OR, all data from these datasets are used for testing and excluded from the training process.
Hardware Specification	Yes	We utilize NVIDIA 8A100 Tensor Core GPUs with 80 GB each for model training and employ 1A100 GPU for model inference.
Software Dependencies	No	We implement all model training using the Py Torch framework and utilize Qwen 1.5 with 14 billion parameters (Bai et al., 2023) as the base model.
Experiment Setup	Yes	The hyperparameters of the training are shown in Table 6. (Un-)Desirable Weight in the table represents the hyperparameters λU and λD of KTO in the paper.