Reward Dimension Reduction for Scalable Multi-Objective Reinforcement Learning

Authors: Giseung Park, Youngchul Sung

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we introduce a simple yet effective reward dimension reduction method to tackle the scalability challenges of multi-objective reinforcement learning algorithms. While most existing approaches focus on optimizing two to four objectives, their abilities to scale to environments with more objectives remain uncertain. Our method uses a dimension reduction approach to enhance learning efficiency and policy performance in multi-objective settings. While most traditional dimension reduction methods are designed for static datasets, our approach is tailored for online learning and preserves Pareto-optimality after transformation. We propose a new training and evaluation framework for reward dimension reduction in multi-objective reinforcement learning and demonstrate the superiority of our method in environments including one with sixteen objectives, significantly outperforming existing online dimension reduction methods.
Researcher Affiliation Academia Giseung Park, Youngchul Sung School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) Daejeon 34141, Republic of Korea EMAIL
Pseudocode No The paper describes the
Open Source Code Yes The link to our code is https://github. com/Giseung-Park/Dimension-Reduction-MORL.
Open Datasets Yes To address this issue, we considered the following two MORL environments: Lunar Lander-5D (Hung et al., 2023) and our modified implementation of an existing traffic light control environment (Alegre, 2019) to create a sixteen-dimensional reward setting.
Dataset Splits No The paper does not explicitly provide details on training/test/validation splits for the datasets, but rather describes how preference vectors were sampled for evaluation: "For evaluation, we generated fifteen and thirty five equidistant points on the simplex for Lunar Lander and the traffic environment, respectively." It also mentions a trimmed mean process for statistical reliability: "we applied a 12.5% trimmed mean by excluding the seeds with maximum and minimum hypervolume values over eight random seeds and reporting the averages of the metrics over the remaining six random seeds."
Hardware Specification Yes We use infrastructures of Intel Xeon Gold 6238R CPU @ 2.20GHz and Intel Core i9-10900X CPU @ 3.70GHz.
Software Dependencies No For our implementation, we adapted morl-baselines (Felten et al., 2023) and integrated it with sumorl (Alegre, 2019), a toolkit designed for traffic light control simulations, as discussed in Section 5. For Lunar Lander-5D, we used morl-baselines (Felten et al., 2023) with the reward function provided by the source code of Hung et al. (2023). Our implementation in Py Torch (Paszke et al., 2019) effectively applies this parameterization, and we solve the optimization in equation 11 using stochastic gradient descent in an online setting. We train Qθ using the Adam optimizer (Kingma & Ba, 2015). The paper mentions PyTorch and Adam optimizer, but it does not specify their version numbers.
Experiment Setup Yes For our proposed method and the baselines, we set the discount factor γ = 0.99 and use a buffer size of 52,000 and 1M for traffic and Lunar Lander, respectively. In Base algorithm (Yang et al., 2019), we utilize a multi-objective action-value network Qθ with an input size of observation dimension plus K, two hidden layers of 128(Lunar Lander)/256(traffic) units each, and Re LU activations after each hidden layer. The output layer has a size of |A| K. For the dimension reduction methods, the Qθ network has an input size of input size of observation dimension plus m, two hidden layers of 128(Lunar Lander)/256(traffic) units with Re LU activations, and an output layer of size |A| m. We train Qθ using the Adam optimizer (Kingma & Ba, 2015), applying the loss function after the first 200 timesteps, with a learning rate of 0.0003 and a minibatch size of 32. Exploration follows an ϵ-greedy strategy, with ϵ linearly decaying from 1.0 to 0.05 over the first 10% of the total timesteps. The target network is updated every 500 timesteps. We update θ using the gradient θL(θ), L(θ) = (1 λ)Lmain(θ) + λLaux(θ), where Lmain(θ) is the primary loss and Laux(θ) is the auxiliary loss in Yang et al. (2019). The weight λ is linearly scheduled from 0 to 1 over the first 75% and 25% percent of the total timesteps in traffic and Lunar Landar, respectively. Sampling preference vectors ωm m during training and execution follows the uniform Dirichlet distribution. For the three online dimension reduction methods (our approach, the autoencoder, and our implementation of online NPCA), we utilize the Adam optimizer for updates. In our method, the matrix A is initialized with each entry set to 1/K. The neural network gϕ has an input dimension of m, two hidden layers of 32 units each, and Re LU activations after each hidden layer. The output layer has a size of K. We use a dropout rate of 0.75 and 0.25 in in traffic and Lunar Landar, respectively (with 0 meaning no dropout). Equation 11 is optimized with a learning rate of 0.0003 and an update interval of 5 timesteps. For the autoencoder, the encoder network has an input size of K, two hidden layers with 32 units each, and Re LU activations after each hidden layer. The output layer has a size of m. The decoder follows the same architecture as gϕ, but without dropout. The reward reconstruction loss is optimized with a learning rate of 0.0001 and an update interval of 20 timesteps. For the online NPCA, we use Re LU parameterization for efficient learning (also implemented in Py Torch (Paszke et al., 2019)) to meet the constraint on matrix U. The matrix U is initialized similarly with each entry set to 1/K. NPCA is optimized with a learning rate of 0.0001, an update interval of 20(traffic)/50(Lunar Lander) timesteps, and β = 50000(traffic)/1000(Lunar Lander). The reduced vector representation of r is U T (r µ) Rm, following the PCA assumption that the transformed vectors are centered (Zass & Shashua, 2006; Cardot & Degras, 2018). For NPCA-ortho in traffic, increasing the value of β did not yield better orthonormality, so we set the update interval to 5 timesteps, keeping the same β value. For incremental PCA, we recursively update the sample mean vector of rewards as µt+1 = t t+1µt + 1 t+1rt+1 RK and the sample covariance matrix as Ct+1 = t t+1Ct + t (t+1)2 (rt+1 µt)(rt+1 µt) RK K for each timestep t (Cardot & Degras, 2018). Every 20 timesteps, we eigen-decompose the covariance matrix, selecting the top m eigenvectors u1, . . . , um RK corresponding to the largest eigenvalues, and update U = [u1, . . . , um] RK m. The reduced vector representation of r is U T (r µ) Rm, assuming the vectors are centered (Cardot & Degras, 2018). U is initialized as a matrix with each entry set to 1/K. For evaluation, we generated fifteen and thirty five equidistant points on the simplex for Lunar Lander and the traffic environment, respectively.