On Generalization Across Environments In Multi-Objective Reinforcement Learning

Authors: Jayden Teoh, Pradeep Varakantham, Peter Vamplew

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our baseline evaluations of state-of-the-art MORL algorithms on this benchmark reveals limited generalization capabilities, suggesting significant room for improvement. Our empirical findings also expose limitations in the expressivity of scalar rewards, emphasizing the need for multi-objective specifications to achieve effective generalization. We further analyzed the algorithmic complexities within current MORL approaches that could impede the transfer in performance from the single- to multiple-environment settings. This work fills a critical gap and lays the groundwork for future research that brings together two key areas in reinforcement learning: solving multi-objective decision-making problems and generalizing across diverse environments.
Researcher Affiliation Academia 1Singapore Management University 2Federation University Australia 1EMAIL EMAIL
Pseudocode No The paper describes algorithms and methods but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes We make our code available at https://github.com/Jayden Teoh/MORL-Generalization. Notably, we open-source our software and release a comprehensive dataset derived from over 1,700 cumulative days of baseline evaluations across multiple SOTA algorithms.
Open Datasets Yes Notably, we open-source our software and release a comprehensive dataset derived from over 1,700 cumulative days of baseline evaluations across multiple SOTA algorithms. We adapted existing domains from MO-Gymnasium (Felten et al., 2023), a multi-objective extension of the Gymnasium library (Towers et al., 2024; Brockman et al., 2016), and introduced new ones, each with expressive parameters controlling environmental variations.
Dataset Splits Yes During training using domain randomization, after each episode concludes, the agent’s start position and orientation, the number of lava blocks, the placement of the goals and lava blocks, and the reward weightages of the goals are all randomly set. When an agent has collected/visited a goal, the weightage of the goal in the state space is set to 0, to indicate that the reward corresponding to that goal has already been awarded. In MO-Super Mario Bros, despite visual similarities across levels, each observation provides sufficient information to determine the optimal action at every time step. For example, the locations of coins, enemies, and bricks are clearly visible. Moreover, since there are only a finite number of stages (32), the agent can deduce its current stage directly from its observations with enough training. Similarly, in MO-Lava Grid, the complete layout of lava and goals, along with the agent’s position and orientation, is fully observable at each time step. Furthermore, as described later in Section F, we concatenate the reward weights for each goal with the agent’s observation, ensuring that the current reward function is explicitly provided. For the continuous control domains like MO-Hopper, MO-Half Cheetah, and MO-Humanoid, context variations arise from changes in dynamics (e.g., gravity, friction), yet the agent’s observations typically include only joint positions and velocities. Consequently, optimal actions cannot be inferred from a single time step. A similar limitation exists in the discrete domain MO-Lunar Lander, where the observations are typically restricted to orientation and velocity. The environment dynamics is, however, inferable when the agent considers its state-action history. Prior work has shown that history-based policies are effective in domains with changing dynamics (Yu et al., 2017; Peng et al., 2018; Tiboni et al., 2024). Therefore, we adopt the standard approach of augmenting the state with a fixed-length history of past state-action pairs. In our main experiments for MO-Hopper, MO-Half Cheetah, MO-Humanoid, and MO-Lunar Lander, we use a history length of 2 so that the observed state at time t is a vector of the form: (st-2, at-2, st-1, at-1, st). For time steps before 2, we repeat the initial state and pad missing actions with zeros.
Hardware Specification Yes Most of our experiments complete within 1–2 days, with all runs kept under five days on a single NVIDIA RTX A5000 GPU and a 48-core AMD EPYC 7643 CPU.
Software Dependencies No The paper mentions using MO-Gymnasium (Felten et al., 2023), Gymnasium (Towers et al., 2024; Brockman et al., 2016), Clean RL (Huang et al., 2022), and refers to a Nature CNN (Mnih et al., 2015), but it does not specify version numbers for these software components or other key dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup Yes Table 3 shows shared training hyperparameters across algorithms for each domain in the MORL generalization benchmark. The scripts to reproduce the results in this paper are provided in the codebase, alongside with more specific hyperparameters for the different algorithms. To have fair evaluations, we utilize the same architectures for the policy and value functions across all algorithms for each domain. Specifically, for MO-Lava Grid, MO-Lunar Lander, MO-Hopper, MO-Half Cheetah, and MO-Humanoid, the policy and value functions are multi-layer perceptrons (MLPs) with four hidden layers of 256 units each. For MO-Super Mario Bros which has image observations, the policy and value functions consist of a Nature CNN (Mnih et al., 2015) followed by a MLP with two hidden layers of 512 units each. For off-policy algorithms that depend on experience replay, we ensure the same replay buffer size is used.