reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ARS: Adaptive Reward Scaling for Multi-Task Reinforcement Learning

Authors: Myungsik Cho, Jongeui Park, Jeonghye Kim, Youngchul Sung

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on the Meta-World benchmark demonstrate that ARS significantly outperforms baseline methods, achieving superior performance on challenging tasks while maintaining overall learning efficiency. These results validate ARS s effectiveness in tackling diverse multi-task RL problems, paving the way for scalable solutions in complex real-world applications.
Researcher Affiliation	Academia	1School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Korea. Correspondence to: Youngchul Sung <EMAIL>.
Pseudocode	Yes	Algorithm 1 Adaptive Reward Scaling (ARS) 1: Initialize policy network πθ, Q-value network Qψ 2: Initialize replay buffer Di for each task Ti C 3: for t = 1, . . . , Tinit do 4: for all Ti C do 5: Interact with the environment of Ti with a random policy and store data in Di 6: end for 7: end for 8: Initialize the reward scaling factors {crew i }N i=1 using (6) 9: for t = Tinit + 1, . . . , do 10: for all Ti C do 11: Interact with the environment of Ti with πθ and store data in Di 12: end for 13: Update θ and ψ, using the data in {Di}N i=1 and the scaling factors {crew i }N i=1 14: if t % Treset == 0 then 15: Update {crew i }N i=1 using (6) 16: Randomly reinitialize θ and ψ 17: end if 18: end for
Open Source Code	No	The paper does not contain any explicit statements about code availability, nor does it provide links to any code repositories.
Open Datasets	Yes	To evaluate the effectiveness of the proposed method on various tasks, we conducted experiments using the Meta-World benchmark (Yu et al., 2019), which includes 50 distinct robotic control tasks involving a Sawyer arm in the Mu Jo Co environment (Todorov et al., 2012). Our experiments used two setups: MT10 and MT50, which consist of 10 and 50 manipulation tasks, respectively. A detailed description of the benchmarks are provided in Appendix A.
Dataset Splits	No	The paper describes the evaluation methodology for reinforcement learning tasks, stating: "Policy evaluation is based on the success ratio across all tasks, where the success ratio for a specific task is determined by averaging its success rate over 10 episodes with different sampled goals." and "each task is trained with randomly sampled goal positions and evaluated across 10 randomized goal configurations...". However, it does not provide explicit training/test/validation splits for a fixed dataset, which is typical for static supervised learning scenarios. Instead, data is collected through interaction with dynamic environments.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models) used for running the experiments.
Software Dependencies	No	The paper mentions "optimizer Adam (Kingma & Ba, 2015)" but does not specify version numbers for any software libraries, frameworks, or programming languages used in the implementation.
Experiment Setup	Yes	In this section, we provide the hyperparameters for ARS used in the MT10 and MT50 experiments in Table 7, along with some general hyperparameters used across ARS and the baselines in Table 8. Table 8. General Multi-task RL hyperparameters. Hyperparameters MT10 MT50 training steps 2 107 1 108 number of reset (nreset) 4 6 replay buffer size per task 1 106 5 105 episode length 500 optimizer Adam (Kingma & Ba, 2015) batch size per task 100 learning rate (all networks) 3e-4 activation for critic Tanh activation for actor Re LU discount factor (γ) 0.99 MLP hidden layer size [400, 400, 400, 400] target network update period 1 tau(τ) 5e-3