reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fat-to-Thin Policy Optimization: Offline Reinforcement Learning with Sparse Policies

Authors: Lingwei Zhu, Han Wang, Yukie Nagai

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the experiments, we first verify that Ft TPO is capable of learning a sparse and safe policy on a simulated medicine environment. Then on the D4RL Mujoco benchmark, we demonstrate that Ft TPO can perform favorably against several popular offline algorithms that by default employ the Gaussian policy. Lastly, we examine in the ablation studies that Ft TPO improves on its components.
Researcher Affiliation	Academia	Lingwei Zhu University of Tokyo EMAIL Han Wang University of Alberta EMAIL Yukie Nagai University of Tokyo
Pseudocode	Yes	Algorithm 1: q-Gaussian Initialization Input: qf > 1 and qs < 1 Init. πϕ by Nqf (µϕ, Σϕ) per Eq. (5) Init. πθ by Nqs(µθ, Σθ) return πϕ, πθ \|\| Algorithm 2: q-Gaussian Sampling Input: q , N, µ, Σ sample u1, u2 Uniform(0, 1)N compute z = p 2 lnq (u1) cos (2πu2) return µ + Σ 1 2 z \|\| Algorithm 3: Fat-to-Thin Policy Optimization Input: D, T, τ > 0, qw < 1 Initialize policies by Alg. 1 ; while t < T do sample states s from dataset D ; sample actions a from behavior policy πD; compute Qψt(s, a) and Vζt(s); update ϕt to ϕt+1 by minimizing b Es,a h expqw Qψt(s,a) Vζt(s) τ ln πϕt(a\|s) i ; sample b from πθt by Alg. 2; copy µϕt+1 to µθt ; update θt to θt+1 by minimizing b Es,b h πϕt(b\|s) πθt(b\|s) 1 ln πϕt(b\|s) πθt(b\|s) i ;
Open Source Code	Yes	Our code is available at https://github.com/lingweizhu/fat2thin.
Open Datasets	Yes	Then on the D4RL Mujoco benchmark, we demonstrate that Ft TPO can perform favorably against several popular offline algorithms that by default employ the Gaussian policy. ... We followed (Li et al., 2023) on reproducing this environment, see their Appendix. D.1 for detail.
Dataset Splits	No	The paper mentions "The offline dataset contains 50 trajectories each comprising 24 steps." for the safety-critical treatment simulation, and uses "D4RL Mu Jo Co suite" as a standard benchmark. While these indicate the datasets used, the paper does not explicitly specify the training/validation/test splits used for its experiments.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It only states computation time in hours.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We provide parameter settings of D4RL experiments in Table 1 and the synthetic environment in Table 2. The environment-specific best hyperparameters are listed in Table 3 and 4, respectively. ... Learning rate Ft T: Swept in {1e 3, 3e 4} Baselines: Swept in {3 10 3, 1 10 3, 3 10 4, 1 10 4} Weights Ft T: Swept in {1.0, 0.5, 0.01} Baselines: Same as the number reported in the publication of each algorithm. Except in TAWAC + medium datasets, the value was swept in {1.0, 0.5, 0.01}. Discount rate 0.99 Timeout 1000 Training Iterations 1,000,000 Hidden size of Value network 256 Hidden layers of Value network 2 Hidden size of Policy network 256 Hidden layers of Policy network 2 Minibatch size 256 Adam.β1 0.9 Adam.β2 0.99 Target network synchronization Polyak averaging with α = 0.005 Number of seeds for sweeping 5 Number of seeds for the best setting 10 STD in sparse policy Clipped at the upper bound of the action space