Fat-to-Thin Policy Optimization: Offline Reinforcement Learning with Sparse Policies
Authors: Lingwei Zhu, Han Wang, Yukie Nagai
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experiments, we first verify that Ft TPO is capable of learning a sparse and safe policy on a simulated medicine environment. Then on the D4RL Mujoco benchmark, we demonstrate that Ft TPO can perform favorably against several popular offline algorithms that by default employ the Gaussian policy. Lastly, we examine in the ablation studies that Ft TPO improves on its components. |
| Researcher Affiliation | Academia | Lingwei Zhu University of Tokyo EMAIL Han Wang University of Alberta EMAIL Yukie Nagai University of Tokyo |
| Pseudocode | Yes | Algorithm 1: q-Gaussian Initialization Input: qf > 1 and qs < 1 Init. πϕ by Nqf (µϕ, Σϕ) per Eq. (5) Init. πθ by Nqs(µθ, Σθ) return πϕ, πθ || Algorithm 2: q-Gaussian Sampling Input: q , N, µ, Σ sample u1, u2 Uniform(0, 1)N compute z = p 2 lnq (u1) cos (2πu2) return µ + Σ 1 2 z || Algorithm 3: Fat-to-Thin Policy Optimization Input: D, T, τ > 0, qw < 1 Initialize policies by Alg. 1 ; while t < T do sample states s from dataset D ; sample actions a from behavior policy πD; compute Qψt(s, a) and Vζt(s); update ϕt to ϕt+1 by minimizing b Es,a h expqw Qψt(s,a) Vζt(s) τ ln πϕt(a|s) i ; sample b from πθt by Alg. 2; copy µϕt+1 to µθt ; update θt to θt+1 by minimizing b Es,b h πϕt(b|s) πθt(b|s) 1 ln πϕt(b|s) πθt(b|s) i ; |
| Open Source Code | Yes | Our code is available at https://github.com/lingweizhu/fat2thin. |
| Open Datasets | Yes | Then on the D4RL Mujoco benchmark, we demonstrate that Ft TPO can perform favorably against several popular offline algorithms that by default employ the Gaussian policy. ... We followed (Li et al., 2023) on reproducing this environment, see their Appendix. D.1 for detail. |
| Dataset Splits | No | The paper mentions "The offline dataset contains 50 trajectories each comprising 24 steps." for the safety-critical treatment simulation, and uses "D4RL Mu Jo Co suite" as a standard benchmark. While these indicate the datasets used, the paper does not explicitly specify the training/validation/test splits used for its experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It only states computation time in hours. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We provide parameter settings of D4RL experiments in Table 1 and the synthetic environment in Table 2. The environment-specific best hyperparameters are listed in Table 3 and 4, respectively. ... Learning rate Ft T: Swept in {1e 3, 3e 4} Baselines: Swept in {3 10 3, 1 10 3, 3 10 4, 1 10 4} Weights Ft T: Swept in {1.0, 0.5, 0.01} Baselines: Same as the number reported in the publication of each algorithm. Except in TAWAC + medium datasets, the value was swept in {1.0, 0.5, 0.01}. Discount rate 0.99 Timeout 1000 Training Iterations 1,000,000 Hidden size of Value network 256 Hidden layers of Value network 2 Hidden size of Policy network 256 Hidden layers of Policy network 2 Minibatch size 256 Adam.β1 0.9 Adam.β2 0.99 Target network synchronization Polyak averaging with α = 0.005 Number of seeds for sweeping 5 Number of seeds for the best setting 10 STD in sparse policy Clipped at the upper bound of the action space |