Normality-Guided Distributional Reinforcement Learning for Continuous Control

Authors: Ju-Seung Byun, Andrew Perrault

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide an empirical validation that uses PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015a) on several continuous control tasks. We compare our methods to standard algorithms (PPO and TRPO) as well as the ensemble-based approach. We find that our method exhibits better performance than the ensemble-based method in 10/16 tested environments, while using half as many weights and training twice as fast.
Researcher Affiliation Academia Ju-Seung Byun EMAIL Department of Computer Science and Engineering The Ohio State University Andrew Perrault EMAIL Department of Computer Science and Engineering The Ohio State University
Pseudocode Yes Algorithm 1 MC-CLT with Uncertainty Weight 1: Input: policy π, distributional value function V D,π θ , variance network σ2 ψ, rollout buffer B 2: Initialize π, V D,π, and σ2 ψ 3: for i = 1 to epoch_num do 4: for j = 1 to rollout_num do 5: at π( |st) 6: st+1 P( |st, at) 7: Compute N(qavg(s), σ2 avg) with V D,π(st) = {q0(st), q1(st), ..., q N 1(st)} 8: Find q (st) from N(qavg(s), σ2 avg) 9: Compute the mean squared error E = PN 1 i=0 (qi(st) q i(st))2 10: Store (st, at, r(st, at), E, V D,π(st), σ2 ψ(st)) in B 11: if st+1 is terminal then 12: Reset env 13: end if 14: end for 15: Perform the parametric search to find the temperature T (Appendix B) 16: Compute target quantiles for V D,π with the stored return and σ2 ψ 17: Minimize Lσ2(ψ) and Lq(θ) 18: Optimize π with a policy objective scaled by w 19: end for
Open Source Code Yes We provide the hyperparameters used in our evaluations and all source code is available at https://github. com/shashacks/MC_CLT.
Open Datasets Yes We evaluate our method on continuous Open AI gym Box2D (Brockman et al., 2016) and Mu Jo Co tasks (Todorov et al., 2012), as these environments have continuous action spaces and dense reward functions to use the normal approximation of MC-CLT (Theorem 1).
Dataset Splits No The paper uses continuous Open AI gym Box2D and Mu Jo Co tasks, which are reinforcement learning environments rather than static datasets with explicit training/test/validation splits. It mentions running experiments for 100 episodes and 30 different settings, but not specific data partitioning.
Hardware Specification No The authors would like to thank the Ohio Supercomputer Center (Center, 1987) for providing the computational resources used in this research. This statement indicates the use of computational resources but does not specify any particular hardware models (e.g., GPU/CPU models, memory details).
Software Dependencies No Our implementation is based on Spinning Up (Achiam, 2018), an open-source resource by Open AI that provides implementations of several reinforcement learning algorithms and tools to help researchers get started with deep RL. For our experiments, we chose two representative deep reinforcement learning algorithms, Policy Optimization (PPO) (Schulman et al., 2017) and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a). All networks are updated with Adam optimizer (Kingma & Ba, 2014). The paper mentions several software components (Spinning Up, PPO, TRPO, Adam optimizer) and references their creators, but it does not provide specific version numbers for any of them.
Experiment Setup Yes We use the default hyperparameters such as learning rate and batch size. All policies have a two-layer tanh network with 64 x 32 units, and all of the value function and the distributional value function has a two-layer Re LU network with 64 x 64 units or 128 x 128 units for all environments. We use 8 quantile bars for all experiments, as we empirically find that using 8 or more quantile bars provides smiliar satisfactory performance. wtar is chosen from wtar {0.85, 0.9}, and wmin is chosen from wmin {0.4, 0.5, 0.6}.