$q$-exponential family for policy optimization

Authors: Lingwei Zhu, Haseeb Shah, Han Wang, Yukie Nagai, Martha White

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide comprehensive experiments on both online and offline problems showing that q-exponential family policies can improve on the Gaussian by a large margin. In particular, we find that the Student s t policy is more stable, performing well across algorithms and problems, shown in Figure 2. We ran experiments with different algorithms, to get a better sense of how conclusions about policy parameterization vary across different actor-critic algorithms.
Researcher Affiliation Academia Lingwei Zhu University of Tokyo EMAIL Haseeb Shah University of Alberta EMAIL Han Wang University of Alberta EMAIL Yukie Nagai University of Tokyo Martha White University of Alberta
Pseudocode Yes Algorithm 1: q-Gaussian sampling Algorithm 2: Out-of-support action handling for the light-tailed q-Gaussian
Open Source Code Yes Our code is available at https://github.com/lingweizhu/qexp.
Open Datasets Yes We used the standard benchmark Mu Jo Co suite from D4RL to evaluate algorithm-policy combinations (Fu et al., 2020). The D4RL offline datasets all contain 1 million samples generated by a partially trained SAC agent.
Dataset Splits No The paper describes the composition of the D4RL datasets (Medium-Replay, Medium, Medium-Expert) and how many samples they contain, but does not specify explicit train/test/validation splits for their own experiments beyond using these named datasets as distinct experimental settings. For online experiments, it details evaluation procedures (e.g., averaging over 3 or 1 episode) rather than dataset splits.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions 'Py Torch (Paszke et al., 2019)' but does not specify a version number for PyTorch or any other software component used in the experiments.
Experiment Setup Yes D.2 ONLINE EXPERIMENTS: We used a 2-layer network with 64 nodes on each layer and Re LU non-linearities. The batch size was 32. Agents used a target network for the critic, updated with polyak averaging with α = 0.01. Table 4: Default hyperparameters and sweeping choices for online experiments. D.3 OFFLINE EXPERIMENTS: We used a 2-layer network with 256 nodes on each layer. The batch size was 256. Agents used a target network for the critic, updated with polyak averaging with α = 0.005. The discount rate was set to 0.99. Table 5: Default hyperparameters and sweeping choices for offline experiments.