reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

$q$-exponential family for policy optimization

Authors: Lingwei Zhu, Haseeb Shah, Han Wang, Yukie Nagai, Martha White

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide comprehensive experiments on both online and offline problems showing that q-exponential family policies can improve on the Gaussian by a large margin. In particular, we find that the Student s t policy is more stable, performing well across algorithms and problems, shown in Figure 2. We ran experiments with different algorithms, to get a better sense of how conclusions about policy parameterization vary across different actor-critic algorithms.
Researcher Affiliation	Academia	Lingwei Zhu University of Tokyo EMAIL Haseeb Shah University of Alberta EMAIL Han Wang University of Alberta EMAIL Yukie Nagai University of Tokyo Martha White University of Alberta
Pseudocode	Yes	Algorithm 1: q-Gaussian sampling Algorithm 2: Out-of-support action handling for the light-tailed q-Gaussian
Open Source Code	Yes	Our code is available at https://github.com/lingweizhu/qexp.
Open Datasets	Yes	We used the standard benchmark Mu Jo Co suite from D4RL to evaluate algorithm-policy combinations (Fu et al., 2020). The D4RL offline datasets all contain 1 million samples generated by a partially trained SAC agent.
Dataset Splits	No	The paper describes the composition of the D4RL datasets (Medium-Replay, Medium, Medium-Expert) and how many samples they contain, but does not specify explicit train/test/validation splits for their own experiments beyond using these named datasets as distinct experimental settings. For online experiments, it details evaluation procedures (e.g., averaging over 3 or 1 episode) rather than dataset splits.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions 'Py Torch (Paszke et al., 2019)' but does not specify a version number for PyTorch or any other software component used in the experiments.
Experiment Setup	Yes	D.2 ONLINE EXPERIMENTS: We used a 2-layer network with 64 nodes on each layer and Re LU non-linearities. The batch size was 32. Agents used a target network for the critic, updated with polyak averaging with α = 0.01. Table 4: Default hyperparameters and sweeping choices for online experiments. D.3 OFFLINE EXPERIMENTS: We used a 2-layer network with 256 nodes on each layer. The batch size was 256. Agents used a target network for the critic, updated with polyak averaging with α = 0.005. The discount rate was set to 0.99. Table 5: Default hyperparameters and sweeping choices for offline experiments.