reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Flat Reward in Policy Parameter Space Implies Robust Reinforcement Learning

Authors: HyunKyu Lee, Sung Whan Yoon

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively simulate various RL environments, confirming the consistent benefits of flatter reward landscapes in enhancing the robustness of RL under diverse conditions, including action selection, transition dynamics, and reward functions.
Researcher Affiliation	Academia	Hyun Kyu Lee1, Sung Whan Yoon1,2 1Graduate School of Artificial Intelligence and 2Department of Electrical Engineering Ulsan National Institute of Science and Technology EMAIL
Pseudocode	Yes	Algorithm 1 SAM Integrated with PPO
Open Source Code	Yes	The code for these experiments is available at https://github.com/HK-05/flatreward-RRL.
Open Datasets	Yes	To validate our claims, we conduct extensive experiments in various Mu Jo Co environments (Todorov et al., 2012), including Hopper, Walker2d, and Half Cheetah, by varying actions, transition probabilities, and rewards. [...] To validate the applicability and reliability of our SAM-enhanced method in a broader context, we extended our experiments to discrete action environments provided by Open AI Gym: Cart Pole-v1 and Lunar Lander-v2.
Dataset Splits	No	Each experiment was conducted over five independent trials, each initialized with a different random seed to ensure statistical significance. Furthermore, for each evaluation, we performed 100 evaluation runs and averaged the results to enhance the stability and accuracy of our findings.
Hardware Specification	Yes	Table 9: Computational costs comparison to achieve convergence (done with NVIDIA RTX 3090)
Software Dependencies	No	Both the actor and critic learning rates are set to 3 10 4, with the Adam optimizer used for optimization.
Experiment Setup	Yes	For all agents, including PPO, SAM+PPO, and RNAC, we employ a multi-layer perceptron (MLP) architecture for both the actor (policy network) and the critic (value network). The network consists of an input layer matching the state dimension of the environment, followed by three fully connected hidden layers, each with 64 neurons and Tanh activation functions. [...] The shared hyperparameters are as follows: the discount factor γ is set to 0.99, the GAE parameter λ is 0.95, and the PPO clip parameter ϵ is 0.2. Both the actor and critic learning rates are set to 3 10 4, with the Adam optimizer used for optimization. The batch size is 2048, and the mini-batch size is 64, with 10 PPO epochs per update (Kepochs = 10).