Reinforcement Learning from Bagged Reward

Authors: Yuting Tang, Xin-Qiang Cai, Yao-Xiang Ding, Qiyu Wu, Guoqing Liu, Masashi Sugiyama

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the following experiment section, we scrutinize the efficacy of our proposed method using benchmark tasks from both the Mu Jo Co (Brockman et al., 2016) and the Deep Mind Control Suite (Tassa et al., 2018) environments, focusing on scenarios with bagged rewards. Initially, we assess the performance of our method to understand its overall effectiveness. Subsequently, we examine whether the proposed RBT reward model accurately predicts rewards. Finally, we evaluate the indispensability of each component of the reward model, questioning if every part is essential.
Researcher Affiliation Collaboration Yuting Tang EMAIL The University of Tokyo & RIKEN Center for AIP Tokyo, Japan Xin-Qiang Cai EMAIL RIKEN Center for AIP Tokyo, Japan Yao-Xiang Ding EMAIL State Key Lab for CAD & CG, Zhejiang University Hangzhou, China Qiyu Wu EMAIL The University of Tokyo Tokyo, Japan Guoqing Liu EMAIL Microsoft Research AI4Science Beijing, China Masashi Sugiyama EMAIL RIKEN Center for AIP & The University of Tokyo Tokyo, Japan
Pseudocode Yes Algorithm 1 Policy Optimization with RBT 1: Initialize replay buffer D, RBT parameters θ. 2: for trajectory τ collected from the environment do 3: Store trajectory τ with bag information {(Bi,ni, R(Bi,ni))}Bi,ni Bτ in D. 4: Sample batches from D. 5: Estimate bag loss based on Eq. equation 7. 6: Update RBT parameters θ based on the loss. 7: Relabel rewards in D using the updated RBT. 8: Optimize policy using the relabeled data by off-the-shelf RL algorithms (such as SAC (Haarnoja et al., 2018)).
Open Source Code Yes Codes are available in https://github.com/Tang-Yuting/RLBR.
Open Datasets Yes We evaluated our method on benchmark tasks from the Mu Jo Co locomotion suite (Ant-v2, Hopper-v2, Half Cheetah-v2, and Walker2d-v2) and the Deep Mind Control Suite (cheetah-run, quadruped-walk, fish-upright, cartpole-swingup, ball_in_cup-catch, and reacher-hard).
Dataset Splits No The paper describes the number of time steps collected for training (1e6) and the maximum episode length (1,000 steps), and reports results averaged over 6 random seeds. However, it does not provide specific fixed training/test/validation dataset splits, as is typical in reinforcement learning where data is generated through interaction with environments rather than split from a static pre-existing dataset.
Hardware Specification Yes The computational resources for these procedures were NVIDIA Ge Force RTX 2080 Ti GPU clusters with 8GB of memory, dedicated to training and evaluating tasks.
Software Dependencies Yes We used Mu Jo Co version 2.0 for our simulations, which is available at http://www.mujoco.org/.
Experiment Setup Yes Table 3: Hyper-parameters of RBT. Hyper-parameter Value Number of Causal Transformer layers 3 Number of bidirectional attention layers 1 Number of attention heads 4 Embedding dimension 256 Batch size 64 Dropout rate 0.1 Learning rate 0.0001 Optimizer Adam W (Loshchilov & Hutter, 2018) Weight decay 0.0001 Warmup steps 100 Total gradient steps 10000