Preference-based Deep Reinforcement Learning for Historical Route Estimation

Authors: Boshen Pan, Yaoxin Wu, Zhiguang Cao, Yaqing Hou, Guangyu Zou, Qiang Zhang

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that the method aligns generated solutions more closely with human preferences. Moreover, it exhibits strong generalization performance across a variety of instances, offering a robust solution for different VRP scenarios. We evaluated the performance of the proposed preference-driven deep reinforcement learning framework on the classic CVRP. The instances used in the experiments were sourced from CVRPLIB. The experiments were conducted on a computer equipped with an Intel(R) Core(TM) i5-13400 2.5GHz CPU, 32.0GB RAM, and an NVIDIA Ge Force RTX 4090 GPU, with model training and inference carried out using POMO [Kwon et al., 2020], which is widely regarded as a classic benchmark algorithm in the VRP field.
Researcher Affiliation Academia 1School of Computer Science and Technology, Dalian University of Technology 2Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology 3School of Computing and Information Systems, Singapore Management University Singapore 4Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education, China EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods and equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is publicly available.1 https://github.com/pandarking/Preference-based-DRL
Open Datasets Yes The instances used in the experiments were sourced from CVRPLIB.
Dataset Splits No Specifically, we selected a benchmark instance containing 73 customer nodes (excluding the depot) from CVRPLIB as the initial dataset. From this instance, we randomly sampled sub-instances containing 20 customer nodes and randomly assigned demand values to each customer node. Each training batch consisted of 50 sub-instances, ensuring the diversity of the data. For each sub-instance, we first calculated its absolute distance matrix D = [dij], consisting of pairwise distances between all available nodes. A random matrix E = [eij] was then introduced, with elements eij sampled from a uniform distribution U(0.8, 1.2). By element-wise multiplying the distance matrix D with E, we generated a preference matrix P = D E to simulate human preferences in route selection. After obtaining the preference matrix P, we used a heuristic algorithm to solve an initial historical route for each subinstance. To further simulate the diversity of human route selections in real-world scenarios, we introduced random perturbations to each initial route solution, thereby generating additional historical routes. Specifically, each sub-instance generated 30 historical routes, including the initial solution and its 29 randomly perturbed versions. These perturbed routes reflect possible changes in human preferences by adjusting the order of nodes and the structure of the route. Through the generation process, we constructed a training dataset with diverse and dynamic preferences, ensuring both the effectiveness of model training and a more realistic representation of the operational characteristics of CVRP instances. While the paper describes data generation and the use of 'unseen instances' for generalization, it does not explicitly provide fixed training/test/validation splits with percentages or sample counts for the CVRPLIB instances or the generated sub-instances.
Hardware Specification Yes The experiments were conducted on a computer equipped with an Intel(R) Core(TM) i5-13400 2.5GHz CPU, 32.0GB RAM, and an NVIDIA Ge Force RTX 4090 GPU
Software Dependencies No model training and inference carried out using POMO [Kwon et al., 2020]. While POMO is a specific algorithm/framework, the paper does not provide a specific version number for POMO or any other key software dependencies.
Experiment Setup Yes To evaluate the effect of the parameter β, we set its values to 0, 0.1, 0.9, and 1 while keeping the other configurations unchanged.