ARS: Adaptive Reward Scaling for Multi-Task Reinforcement Learning
Authors: Myungsik Cho, Jongeui Park, Jeonghye Kim, Youngchul Sung
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on the Meta-World benchmark demonstrate that ARS significantly outperforms baseline methods, achieving superior performance on challenging tasks while maintaining overall learning efficiency. These results validate ARS s effectiveness in tackling diverse multi-task RL problems, paving the way for scalable solutions in complex real-world applications. |
| Researcher Affiliation | Academia | 1School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Korea. Correspondence to: Youngchul Sung <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Adaptive Reward Scaling (ARS) 1: Initialize policy network πθ, Q-value network Qψ 2: Initialize replay buffer Di for each task Ti C 3: for t = 1, . . . , Tinit do 4: for all Ti C do 5: Interact with the environment of Ti with a random policy and store data in Di 6: end for 7: end for 8: Initialize the reward scaling factors {crew i }N i=1 using (6) 9: for t = Tinit + 1, . . . , do 10: for all Ti C do 11: Interact with the environment of Ti with πθ and store data in Di 12: end for 13: Update θ and ψ, using the data in {Di}N i=1 and the scaling factors {crew i }N i=1 14: if t % Treset == 0 then 15: Update {crew i }N i=1 using (6) 16: Randomly reinitialize θ and ψ 17: end if 18: end for |
| Open Source Code | No | The paper does not contain any explicit statements about code availability, nor does it provide links to any code repositories. |
| Open Datasets | Yes | To evaluate the effectiveness of the proposed method on various tasks, we conducted experiments using the Meta-World benchmark (Yu et al., 2019), which includes 50 distinct robotic control tasks involving a Sawyer arm in the Mu Jo Co environment (Todorov et al., 2012). Our experiments used two setups: MT10 and MT50, which consist of 10 and 50 manipulation tasks, respectively. A detailed description of the benchmarks are provided in Appendix A. |
| Dataset Splits | No | The paper describes the evaluation methodology for reinforcement learning tasks, stating: "Policy evaluation is based on the success ratio across all tasks, where the success ratio for a specific task is determined by averaging its success rate over 10 episodes with different sampled goals." and "each task is trained with randomly sampled goal positions and evaluated across 10 randomized goal configurations...". However, it does not provide explicit training/test/validation splits for a fixed dataset, which is typical for static supervised learning scenarios. Instead, data is collected through interaction with dynamic environments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models) used for running the experiments. |
| Software Dependencies | No | The paper mentions "optimizer Adam (Kingma & Ba, 2015)" but does not specify version numbers for any software libraries, frameworks, or programming languages used in the implementation. |
| Experiment Setup | Yes | In this section, we provide the hyperparameters for ARS used in the MT10 and MT50 experiments in Table 7, along with some general hyperparameters used across ARS and the baselines in Table 8. Table 8. General Multi-task RL hyperparameters. Hyperparameters MT10 MT50 training steps 2 107 1 108 number of reset (nreset) 4 6 replay buffer size per task 1 106 5 105 episode length 500 optimizer Adam (Kingma & Ba, 2015) batch size per task 100 learning rate (all networks) 3e-4 activation for critic Tanh activation for actor Re LU discount factor (γ) 0.99 MLP hidden layer size [400, 400, 400, 400] target network update period 1 tau(τ) 5e-3 |