reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Large Language Model-driven Large Neighborhood Search for Large-Scale MILP Problems

Authors: Huigen Ye, Hua Xu, An Yan, Yaoyang Cheng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive Experimental Validation: We validate the effectiveness of our proposed LLM-LNS at two levels. First, we test its agent s performance on heuristic generation tasks of combinatorial optimization problems, demonstrating its superiority over state-of-the-art methods such as Fun Search and EOH. Second, we evaluate its performance on large-scale MILP problems, where it outperforms traditional LNS methods, ML-based LNS methods, and leading solvers.
Researcher Affiliation	Academia	1Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2Beijing National Research Center for Information Science and Technology, Beijing 100084, China. Correspondence to: Hua Xu <EMAIL>.
Pseudocode	Yes	For the framework s detailed pseudocode, see Appendix B.1. Algorithm 1 Pseudocode of the Proposed LLM-LNS Framework
Open Source Code	Yes	The code of LLM-LNS is open-sourced at https://github.com/thuiar/LLM-LNS.
Open Datasets	Yes	We validate the effectiveness of the proposed LLM-LNS framework, we conduct two sets of experiments. First, we evaluate our proposed Dual-layer Self-evolutionary LLM Agent on heuristic generation tasks for combinatorial optimization problems. We focus on two widely studied problems: Online Bin Packing (Seiden, 2002) and the Traveling Salesman Problem (TSP) (Hoffman et al., 2013). ... To validate the effectiveness of the proposed LLM-LNS framework for large-scale MILP problems, we evaluate its performance on four widely-used benchmark datasets: Set Covering (SC) (Caprara et al., 2000), Minimum Vertex Cover (MVC) (Dinur & Safra, 2005), Maximum Independent Set (MIS) (Tarjan & Trojanowski, 1977), and Mixed Integer Knapsack Set (MIKS) (Atamt urk, 2003). ... We use a set of standard problem instances based on four canonical MILP problems: Maximum Independent Set (MIS), Minimum Vertex Covering (MVC), Set Covering (SC), and Mixed Integer Knapsack Set (MIKS).
Dataset Splits	Yes	For the TSP experiments, we used the same five TSPLib instances d198, eil76, rat99, rl1889, and u1060 as the training set for evolving our policies. None of the other TSPLib instances used in evaluation were included in training. ... For the Bin Packing task, we adopted the Weibull 5k test dataset as the training data, consistent with the setup in EOH and Fun Search. ... These generated instances are randomly partitioned into training and testing sets, preventing selection bias and ensures a more reliable assessment of our framework s generalization capabilities across different problem types and scales. ... The small-scale training instances are designed with sizes corresponding to 1% of the decision variables and constraints of the large-scale testing instances. These small-scale problems are used to train and evolve heuristic and prompt strategies. The large-scale testing instances are significantly larger, featuring up to 106 decision variables and 3 106 constraints (as shown in Table 6). These instances are used to evaluate the generalization ability of the strategies evolved during training.
Hardware Specification	No	No specific hardware details (like GPU/CPU models or cloud resources) were mentioned in the paper regarding the experimental setup.
Software Dependencies	Yes	Gurobi (Gurobi Optimization, LLC, 2023) and SCIP (Maher et al., 2016)... In the experiments, Re Evo exhibited poor stability when using GPT-4o-mini. ... We conducted experiments to evaluate the robustness of the LLM-LNS framework across various large language models, including GPT-4o, GPT-4o-mini, Deep Seek, Gemini-1.5-Pro, and Llama-3.1-70B. ... import numpy as np
Experiment Setup	Yes	C.1. Dual-layer Self-evolutionary LLM Agent Parameters The following key parameters were used for the evolutionary process of the LLM agent: h: Represents the number of top-performing heuristic strategies used to evaluate each prompt strategy. ... In our experiments, h is set to half of the population size. ... l: Denotes the number of top individuals in the heuristic population that are monitored for stagnation. ... In all our experiments, l is set to 4. t: The number of consecutive generations during which the top-l individuals must remain unchanged before stagnation is detected. In all our experiments, t is set to 3. C.2. Adaptive Large Neighborhood Search (ALNS) Parameters For ALNS, we use the following parameters: Neighborhood size k: Set to half of the decision variable count n. ... Time limit T: ... 100 seconds for problems with approximately 100K variables and 200 seconds for problems with around 1M variables. Threshold ϵ: ... We set ϵ = 1e-3. Iteration limit p: ... We set p = 3. Minimum and maximum neighborhood sizes kmin, kmax: ... kmin = 0 and kmax = n (the total number of decision variables in the problem). Adjustment rate u%: ... we set u% = 10.