Reward Translation via Reward Machine in Semi-Alignable MDPs

Authors: Yun Hua, Haosheng Chen, Wenhao Li, Bo Jin, Baoxiang Wang, Hongyuan Zha, Xiangfeng Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section presents experiments addressing two primary research questions: 1) Can the NRT framework effectively extract abstract alignments from semi-alignable MDPs across reinforcement learning domains? 2) Does transferred reward, based on abstract alignment, enhance efficiency and performance in target task training? To investigate these questions, we evaluate the NRT framework in two sparse reward settings: 3D Visual Navigation and Mu Jo Co, verifying both isomorphic and homomorphic reward machines. Experiments substantiate our approach s effectiveness in tasks under environments with semialignable MDPs.
Researcher Affiliation Academia 1Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai, China 2School of Computer Science and Technology, East China Normal University, Shanghai, China 3School of Software Engineering, Tongji University, Shanghai, China 4School of Data Science, The Chinese University of Hong Kong, Shenzhen, Shenzhen, China 5Key Laboratory of Mathematics and Engineering Applications, East China Normal University, Shanghai, China. Correspondence to: Xiangfeng Wang <EMAIL>.
Pseudocode No The paper describes the Neural Reward Translation (NRT) framework and its components conceptually and mathematically. While it includes Python code snippets in Appendix A.3 as examples of LLM generated code for specific functions, it does not present structured pseudocode or algorithm blocks for the overall NRT methodology.
Open Source Code Yes An early-stage version of the code is available at: https://github.com/hyyh28/reward translation. Note that the codebase is still under development and may lack full documentation or polish.
Open Datasets Yes In the 3D visual navigation environment, we selected the Sign task in the Miniworld (Chevalier-Boisvert et al., 2023) as the target task, with the Text-Sign task serving as the original task. We conducted experiments using Mujoco environments, selecting Half Cheetah, Hopper, and Ant as target tasks... All target tasks follow the standard Open AI Gym settings (Brockman et al., 2016)
Dataset Splits No The paper discusses training RL agents in simulation environments (NChain, Miniworld, Mujoco) and mentions 'standard Open AI Gym settings (Brockman et al., 2016)'. However, it does not specify any explicit training/test/validation dataset splits, as these environments typically involve continuous interaction rather than pre-split datasets.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running the experiments.
Software Dependencies No The paper states, 'We use the state-of-the-art Proximal Policy Optimization (PPO) algorithm as the baseline and conduct an ablation study with three variants: PPO-RM, PPO-NRT(Reward), and PPO-NRT(RM+Reward).' and 'For this experiment, we utilized the DQN as a baseline'. It also cites 'Open AI Gym (Brockman et al., 2016)'. However, it does not provide specific version numbers for these algorithms, frameworks, or any other software dependencies.
Experiment Setup No The paper states, 'We use the state-of-the-art Proximal Policy Optimization (PPO) algorithm as the baseline and conduct an ablation study with three variants: PPO-RM, PPO-NRT(Reward), and PPO-NRT(RM+Reward).' and in Appendix A.4.3, 'All target tasks employ the same settings as their respective Open AI-Gym versions (Brockman et al., 2016)'. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations in the main text or the provided appendix sections.