Reward Translation via Reward Machine in Semi-Alignable MDPs
Authors: Yun Hua, Haosheng Chen, Wenhao Li, Bo Jin, Baoxiang Wang, Hongyuan Zha, Xiangfeng Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section presents experiments addressing two primary research questions: 1) Can the NRT framework effectively extract abstract alignments from semi-alignable MDPs across reinforcement learning domains? 2) Does transferred reward, based on abstract alignment, enhance efficiency and performance in target task training? To investigate these questions, we evaluate the NRT framework in two sparse reward settings: 3D Visual Navigation and Mu Jo Co, verifying both isomorphic and homomorphic reward machines. Experiments substantiate our approach s effectiveness in tasks under environments with semialignable MDPs. |
| Researcher Affiliation | Academia | 1Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai, China 2School of Computer Science and Technology, East China Normal University, Shanghai, China 3School of Software Engineering, Tongji University, Shanghai, China 4School of Data Science, The Chinese University of Hong Kong, Shenzhen, Shenzhen, China 5Key Laboratory of Mathematics and Engineering Applications, East China Normal University, Shanghai, China. Correspondence to: Xiangfeng Wang <EMAIL>. |
| Pseudocode | No | The paper describes the Neural Reward Translation (NRT) framework and its components conceptually and mathematically. While it includes Python code snippets in Appendix A.3 as examples of LLM generated code for specific functions, it does not present structured pseudocode or algorithm blocks for the overall NRT methodology. |
| Open Source Code | Yes | An early-stage version of the code is available at: https://github.com/hyyh28/reward translation. Note that the codebase is still under development and may lack full documentation or polish. |
| Open Datasets | Yes | In the 3D visual navigation environment, we selected the Sign task in the Miniworld (Chevalier-Boisvert et al., 2023) as the target task, with the Text-Sign task serving as the original task. We conducted experiments using Mujoco environments, selecting Half Cheetah, Hopper, and Ant as target tasks... All target tasks follow the standard Open AI Gym settings (Brockman et al., 2016) |
| Dataset Splits | No | The paper discusses training RL agents in simulation environments (NChain, Miniworld, Mujoco) and mentions 'standard Open AI Gym settings (Brockman et al., 2016)'. However, it does not specify any explicit training/test/validation dataset splits, as these environments typically involve continuous interaction rather than pre-split datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running the experiments. |
| Software Dependencies | No | The paper states, 'We use the state-of-the-art Proximal Policy Optimization (PPO) algorithm as the baseline and conduct an ablation study with three variants: PPO-RM, PPO-NRT(Reward), and PPO-NRT(RM+Reward).' and 'For this experiment, we utilized the DQN as a baseline'. It also cites 'Open AI Gym (Brockman et al., 2016)'. However, it does not provide specific version numbers for these algorithms, frameworks, or any other software dependencies. |
| Experiment Setup | No | The paper states, 'We use the state-of-the-art Proximal Policy Optimization (PPO) algorithm as the baseline and conduct an ablation study with three variants: PPO-RM, PPO-NRT(Reward), and PPO-NRT(RM+Reward).' and in Appendix A.4.3, 'All target tasks employ the same settings as their respective Open AI-Gym versions (Brockman et al., 2016)'. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations in the main text or the provided appendix sections. |