Flat Reward in Policy Parameter Space Implies Robust Reinforcement Learning
Authors: HyunKyu Lee, Sung Whan Yoon
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively simulate various RL environments, confirming the consistent benefits of flatter reward landscapes in enhancing the robustness of RL under diverse conditions, including action selection, transition dynamics, and reward functions. |
| Researcher Affiliation | Academia | Hyun Kyu Lee1, Sung Whan Yoon1,2 1Graduate School of Artificial Intelligence and 2Department of Electrical Engineering Ulsan National Institute of Science and Technology EMAIL |
| Pseudocode | Yes | Algorithm 1 SAM Integrated with PPO |
| Open Source Code | Yes | The code for these experiments is available at https://github.com/HK-05/flatreward-RRL. |
| Open Datasets | Yes | To validate our claims, we conduct extensive experiments in various Mu Jo Co environments (Todorov et al., 2012), including Hopper, Walker2d, and Half Cheetah, by varying actions, transition probabilities, and rewards. [...] To validate the applicability and reliability of our SAM-enhanced method in a broader context, we extended our experiments to discrete action environments provided by Open AI Gym: Cart Pole-v1 and Lunar Lander-v2. |
| Dataset Splits | No | Each experiment was conducted over five independent trials, each initialized with a different random seed to ensure statistical significance. Furthermore, for each evaluation, we performed 100 evaluation runs and averaged the results to enhance the stability and accuracy of our findings. |
| Hardware Specification | Yes | Table 9: Computational costs comparison to achieve convergence (done with NVIDIA RTX 3090) |
| Software Dependencies | No | Both the actor and critic learning rates are set to 3 10 4, with the Adam optimizer used for optimization. |
| Experiment Setup | Yes | For all agents, including PPO, SAM+PPO, and RNAC, we employ a multi-layer perceptron (MLP) architecture for both the actor (policy network) and the critic (value network). The network consists of an input layer matching the state dimension of the environment, followed by three fully connected hidden layers, each with 64 neurons and Tanh activation functions. [...] The shared hyperparameters are as follows: the discount factor γ is set to 0.99, the GAE parameter λ is 0.95, and the PPO clip parameter ϵ is 0.2. Both the actor and critic learning rates are set to 3 10 4, with the Adam optimizer used for optimization. The batch size is 2048, and the mini-batch size is 64, with 10 PPO epochs per update (Kepochs = 10). |