Offline Reinforcement Learning via Tsallis Regularization

Authors: Lingwei Zhu, Matthew Kyle Schlegel, Han Wang, Martha White

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Tsallis AWAC against several offline reinforcement learning methods. The goal of the experiments is to find when is teh proposed method best applied in offline RL, as well as gain insight into the effect of the τ and q parameters (especially as we enter regimes where there aren t closed form expressions for the policies). Finally, we evaluate if the upper bound holds in the continuous action setting. Domain Details. We compare the proposed method against a number of existing algorithms on standard benchmark D4RL environments (Fu et al., 2020). Specifically, we use three datasets from the Mujoco suite in D4RL. Results are the average over 5 runs with ribbon denoting the standard error.
Researcher Affiliation Academia Lingwei Zhu EMAIL Department of Computing Science, University of Alberta Alberta Machine Intelligence Institute (Amii), Canada. Matthew Schlegel EMAIL Department of Computing Science, University of Alberta Alberta Machine Intelligence Institute (Amii), Canada. Han Wang EMAIL Department of Computing Science, University of Alberta Alberta Machine Intelligence Institute (Amii), Canada. Martha White EMAIL Canada CIFAR AI Chair Department of Computing Science, University of Alberta Alberta Machine Intelligence Institute (Amii), Canada.
Pseudocode No The paper describes the proposed algorithms (Tsallis AWAC and Tsallis In AC) by outlining their loss functions and theoretical underpinnings in sections 3, 4, and 5. However, it does not include a distinct, structured pseudocode block or algorithm figure.
Open Source Code Yes We propose a novel actor-critic algorithm: Tsallis Advantage Weighted Actor-Critic (Tsallis AWAC) generalizing AWAC (Nair et al., 2021) and analyze its performance in standard Mujoco environments. Our code is available at https://github.com/lingweizhu/tsallis_regularization.
Open Datasets Yes We compare the proposed method against a number of existing algorithms on standard benchmark D4RL environments (Fu et al., 2020).
Dataset Splits No The paper describes the types of datasets used from D4RL (expert, medium expert, medium replay) and how they were collected. However, it does not specify any explicit training/test/validation splits of these datasets for the experiments performed. The agents train from the specified datasets, and are evaluated every 10k steps on the corresponding Mujoco environments.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It only discusses the experimental setup in terms of algorithms, datasets, and hyperparameters.
Software Dependencies No The paper does not explicitly list any software dependencies with specific version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries).
Experiment Setup Yes All the algorithms used a shared set of hyperparameters found in Table 3. A grid search was done for Tsallis In AC, Tsallis AWAC, In AC according to the same protocol as (Xiao et al., 2023). In addition, we also added a larger learning rate (0.001), which seemed to improve In AC, Tsallis AWAC, and Tsallis In AC slightly on some domains. The best hyperparameters are reported in Table 4. Name Value Number of steps 1000000 Logging interval 10000 Hidden Units 256 Batch Size 256 Target Network Update Rate 1 Polyak Constant 0.995 Discount (γ) 0.99 Learning Rate swept Regularization coefficient (τ) swept