$q$-exponential family for policy optimization
Authors: Lingwei Zhu, Haseeb Shah, Han Wang, Yukie Nagai, Martha White
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide comprehensive experiments on both online and offline problems showing that q-exponential family policies can improve on the Gaussian by a large margin. In particular, we find that the Student s t policy is more stable, performing well across algorithms and problems, shown in Figure 2. We ran experiments with different algorithms, to get a better sense of how conclusions about policy parameterization vary across different actor-critic algorithms. |
| Researcher Affiliation | Academia | Lingwei Zhu University of Tokyo EMAIL Haseeb Shah University of Alberta EMAIL Han Wang University of Alberta EMAIL Yukie Nagai University of Tokyo Martha White University of Alberta |
| Pseudocode | Yes | Algorithm 1: q-Gaussian sampling Algorithm 2: Out-of-support action handling for the light-tailed q-Gaussian |
| Open Source Code | Yes | Our code is available at https://github.com/lingweizhu/qexp. |
| Open Datasets | Yes | We used the standard benchmark Mu Jo Co suite from D4RL to evaluate algorithm-policy combinations (Fu et al., 2020). The D4RL offline datasets all contain 1 million samples generated by a partially trained SAC agent. |
| Dataset Splits | No | The paper describes the composition of the D4RL datasets (Medium-Replay, Medium, Medium-Expert) and how many samples they contain, but does not specify explicit train/test/validation splits for their own experiments beyond using these named datasets as distinct experimental settings. For online experiments, it details evaluation procedures (e.g., averaging over 3 or 1 episode) rather than dataset splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch (Paszke et al., 2019)' but does not specify a version number for PyTorch or any other software component used in the experiments. |
| Experiment Setup | Yes | D.2 ONLINE EXPERIMENTS: We used a 2-layer network with 64 nodes on each layer and Re LU non-linearities. The batch size was 32. Agents used a target network for the critic, updated with polyak averaging with α = 0.01. Table 4: Default hyperparameters and sweeping choices for online experiments. D.3 OFFLINE EXPERIMENTS: We used a 2-layer network with 256 nodes on each layer. The batch size was 256. Agents used a target network for the critic, updated with polyak averaging with α = 0.005. The discount rate was set to 0.99. Table 5: Default hyperparameters and sweeping choices for offline experiments. |