Preference-based Deep Reinforcement Learning for Historical Route Estimation
Authors: Boshen Pan, Yaoxin Wu, Zhiguang Cao, Yaqing Hou, Guangyu Zou, Qiang Zhang
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that the method aligns generated solutions more closely with human preferences. Moreover, it exhibits strong generalization performance across a variety of instances, offering a robust solution for different VRP scenarios. We evaluated the performance of the proposed preference-driven deep reinforcement learning framework on the classic CVRP. The instances used in the experiments were sourced from CVRPLIB. The experiments were conducted on a computer equipped with an Intel(R) Core(TM) i5-13400 2.5GHz CPU, 32.0GB RAM, and an NVIDIA Ge Force RTX 4090 GPU, with model training and inference carried out using POMO [Kwon et al., 2020], which is widely regarded as a classic benchmark algorithm in the VRP field. |
| Researcher Affiliation | Academia | 1School of Computer Science and Technology, Dalian University of Technology 2Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology 3School of Computing and Information Systems, Singapore Management University Singapore 4Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education, China EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and equations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is publicly available.1 https://github.com/pandarking/Preference-based-DRL |
| Open Datasets | Yes | The instances used in the experiments were sourced from CVRPLIB. |
| Dataset Splits | No | Specifically, we selected a benchmark instance containing 73 customer nodes (excluding the depot) from CVRPLIB as the initial dataset. From this instance, we randomly sampled sub-instances containing 20 customer nodes and randomly assigned demand values to each customer node. Each training batch consisted of 50 sub-instances, ensuring the diversity of the data. For each sub-instance, we first calculated its absolute distance matrix D = [dij], consisting of pairwise distances between all available nodes. A random matrix E = [eij] was then introduced, with elements eij sampled from a uniform distribution U(0.8, 1.2). By element-wise multiplying the distance matrix D with E, we generated a preference matrix P = D E to simulate human preferences in route selection. After obtaining the preference matrix P, we used a heuristic algorithm to solve an initial historical route for each subinstance. To further simulate the diversity of human route selections in real-world scenarios, we introduced random perturbations to each initial route solution, thereby generating additional historical routes. Specifically, each sub-instance generated 30 historical routes, including the initial solution and its 29 randomly perturbed versions. These perturbed routes reflect possible changes in human preferences by adjusting the order of nodes and the structure of the route. Through the generation process, we constructed a training dataset with diverse and dynamic preferences, ensuring both the effectiveness of model training and a more realistic representation of the operational characteristics of CVRP instances. While the paper describes data generation and the use of 'unseen instances' for generalization, it does not explicitly provide fixed training/test/validation splits with percentages or sample counts for the CVRPLIB instances or the generated sub-instances. |
| Hardware Specification | Yes | The experiments were conducted on a computer equipped with an Intel(R) Core(TM) i5-13400 2.5GHz CPU, 32.0GB RAM, and an NVIDIA Ge Force RTX 4090 GPU |
| Software Dependencies | No | model training and inference carried out using POMO [Kwon et al., 2020]. While POMO is a specific algorithm/framework, the paper does not provide a specific version number for POMO or any other key software dependencies. |
| Experiment Setup | Yes | To evaluate the effect of the parameter β, we set its values to 0, 0.1, 0.9, and 1 while keeping the other configurations unchanged. |