Mutual Information Regularized Offline Reinforcement Learning

Authors: Xiao Ma, Bingyi Kang, Zhongwen Xu, Min Lin, Shuicheng Yan

NeurIPS 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In addition, our extensive experiments show MISA significantly outperforms a wide range of baselines on various tasks of the D4RL benchmark, e.g., achieving 742.9 total points on gym-locomotion tasks. Our code is attached and will be released upon publication.
Researcher Affiliation Industry Xiao Ma Bingyi Kang Zhongwen Xu Min Lin Shuicheng Yan Sea AI Lab EMAIL
Pseudocode Yes Algorithm 1 Mutual Information Regularized Offline RL Input: Initialize Q network Qϕ, policy network πθ, dataset D, hyperparameters α1 and α2. for t {1, . . . , MAX_STEP} do Train the Q network by gradient descent with objective JQ(ϕ) in Eqn. 12: ϕ := ϕ ηQ ϕJQ(ϕ) Improve policy network by gradient ascent with object Jπ(θ) in Eqn. 13: θ := θ + ηπ θEs D,a πθ(a|s)[Qϕ(s, a)] + α2 θIMISA end Output: The well-trained πθ.
Open Source Code Yes Our code is attached and will be released upon publication.
Open Datasets Yes Our code is implemented in JAX [7] with Flax [19].
Dataset Splits No The paper refers to using specific datasets (e.g., D4RL benchmark, antmaze-v0, gym-locomotion-v2) but does not provide explicit train/validation/test splits by percentages or sample counts in the main text. It mentions 'average the mean returns over 10 evaluation trajectories and 5 random seeds' and 'evaluate the antmaze-v0 environments for 100 episodes instead', which implies evaluation on a test set, but no split details.
Hardware Specification Yes All experiments are conducted on NVIDIA 3090 GPUs.
Software Dependencies No The paper mentions software like JAX [7] and Flax [19], and base RL algorithms like SAC [17], but does not provide specific version numbers for these software dependencies (e.g., 'JAX 0.3.17' or 'Flax 0.6.9').
Experiment Setup Yes We use ELU activation function [11] and SAC [17] as the base RL algorithm. Besides, we use a learning rate of 1 10 4 for both the policy network and Q-value network with a cosine learning rate scheduler. When approximating Eπθ(a|s) e Tψ(s,a) , we use 50 Monte-Carlo samples.