Truncated Gaussian Policy for Debiased Continuous Control

Authors: Ganghun Lee, Minji Kim, Minsu Lee, Byoung-Tak Zhang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical studies and comparisons on various continuous control tasks demonstrate that the truncated Gaussian policies significantly reduce the rate of boundary action usage, while scale-adjusted ones successfully balance the bias and counter-bias. It generally outperforms the Gaussian policy and shows competitive results compared to other approaches designed to counteract the bias.
Researcher Affiliation Academia 1Interdisciplinary Program in Artificial Intelligence, Seoul National University 2Department of Computer Science, Seoul National University 3AIIS, Seoul National University 4School of AI Convergence, Sungshin Women s University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes mathematical formulations and textual explanations of the methods but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets Yes We conduct experiments on eight continuous control tasks from Mu Jo Co (Todorov, Erez, and Tassa 2012), including six locomotion tasks (Half Cheetah, Walker2d, Ant, Hopper, Humanoid, and Swimmer) and two manipulation tasks (Pusher and Reacher), with version v4. For high-dimensional action space experiments, we use two locomotion tasks from Humanoid Bench (h1hand-walk, h1hand-reach) (61 dimensions) (Sferrazza et al. 2024) and two from the Deep Mind Control Suite (DMC) (dog-walk, dog-run) (Tunyasuvunakool et al. 2020) (38 dimensions).
Dataset Splits No The paper states: "Each Mu Jo Co result is averaged over 10 seeds, and Humanoid Bench and Deep Mind results are averaged over 5 seeds." This describes the aggregation of results but does not specify training/test/validation splits for the datasets themselves. The environments (Mu Jo Co, Humanoid Bench, DMC) imply their own standard setups, but explicit split details are not provided.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU, CPU models) used for running the experiments.
Software Dependencies No The paper mentions: "We refer to Clear RL (Huang et al. 2022) for the standard PPO setup." However, it does not specify version numbers for Clear RL or any other software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup Yes We use k = 2, dmin = 0.01 for our SA-TGaussian. The initial scale for all methods is set to σinit = 0.5, which is a learnable parameter and independent of states, as in standard PPO setup. (GEntmax) (using coefficient 0.01.