reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reinforcement Learning from Bagged Reward

Authors: Yuting Tang, Xin-Qiang Cai, Yao-Xiang Ding, Qiyu Wu, Guoqing Liu, Masashi Sugiyama

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the following experiment section, we scrutinize the efficacy of our proposed method using benchmark tasks from both the Mu Jo Co (Brockman et al., 2016) and the Deep Mind Control Suite (Tassa et al., 2018) environments, focusing on scenarios with bagged rewards. Initially, we assess the performance of our method to understand its overall effectiveness. Subsequently, we examine whether the proposed RBT reward model accurately predicts rewards. Finally, we evaluate the indispensability of each component of the reward model, questioning if every part is essential.
Researcher Affiliation	Collaboration	Yuting Tang EMAIL The University of Tokyo & RIKEN Center for AIP Tokyo, Japan Xin-Qiang Cai EMAIL RIKEN Center for AIP Tokyo, Japan Yao-Xiang Ding EMAIL State Key Lab for CAD & CG, Zhejiang University Hangzhou, China Qiyu Wu EMAIL The University of Tokyo Tokyo, Japan Guoqing Liu EMAIL Microsoft Research AI4Science Beijing, China Masashi Sugiyama EMAIL RIKEN Center for AIP & The University of Tokyo Tokyo, Japan
Pseudocode	Yes	Algorithm 1 Policy Optimization with RBT 1: Initialize replay buffer D, RBT parameters θ. 2: for trajectory τ collected from the environment do 3: Store trajectory τ with bag information {(Bi,ni, R(Bi,ni))}Bi,ni Bτ in D. 4: Sample batches from D. 5: Estimate bag loss based on Eq. equation 7. 6: Update RBT parameters θ based on the loss. 7: Relabel rewards in D using the updated RBT. 8: Optimize policy using the relabeled data by off-the-shelf RL algorithms (such as SAC (Haarnoja et al., 2018)).
Open Source Code	Yes	Codes are available in https://github.com/Tang-Yuting/RLBR.
Open Datasets	Yes	We evaluated our method on benchmark tasks from the Mu Jo Co locomotion suite (Ant-v2, Hopper-v2, Half Cheetah-v2, and Walker2d-v2) and the Deep Mind Control Suite (cheetah-run, quadruped-walk, fish-upright, cartpole-swingup, ball_in_cup-catch, and reacher-hard).
Dataset Splits	No	The paper describes the number of time steps collected for training (1e6) and the maximum episode length (1,000 steps), and reports results averaged over 6 random seeds. However, it does not provide specific fixed training/test/validation dataset splits, as is typical in reinforcement learning where data is generated through interaction with environments rather than split from a static pre-existing dataset.
Hardware Specification	Yes	The computational resources for these procedures were NVIDIA Ge Force RTX 2080 Ti GPU clusters with 8GB of memory, dedicated to training and evaluating tasks.
Software Dependencies	Yes	We used Mu Jo Co version 2.0 for our simulations, which is available at http://www.mujoco.org/.
Experiment Setup	Yes	Table 3: Hyper-parameters of RBT. Hyper-parameter Value Number of Causal Transformer layers 3 Number of bidirectional attention layers 1 Number of attention heads 4 Embedding dimension 256 Batch size 64 Dropout rate 0.1 Learning rate 0.0001 Optimizer Adam W (Loshchilov & Hutter, 2018) Weight decay 0.0001 Warmup steps 100 Total gradient steps 10000