reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mutual Information Regularized Offline Reinforcement Learning

Authors: Xiao Ma, Bingyi Kang, Zhongwen Xu, Min Lin, Shuicheng Yan

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In addition, our extensive experiments show MISA significantly outperforms a wide range of baselines on various tasks of the D4RL benchmark, e.g., achieving 742.9 total points on gym-locomotion tasks. Our code is attached and will be released upon publication.
Researcher Affiliation	Industry	Xiao Ma Bingyi Kang Zhongwen Xu Min Lin Shuicheng Yan Sea AI Lab EMAIL
Pseudocode	Yes	Algorithm 1 Mutual Information Regularized Offline RL Input: Initialize Q network Qϕ, policy network πθ, dataset D, hyperparameters α1 and α2. for t {1, . . . , MAX_STEP} do Train the Q network by gradient descent with objective JQ(ϕ) in Eqn. 12: ϕ := ϕ ηQ ϕJQ(ϕ) Improve policy network by gradient ascent with object Jπ(θ) in Eqn. 13: θ := θ + ηπ θEs D,a πθ(a\|s)[Qϕ(s, a)] + α2 θIMISA end Output: The well-trained πθ.
Open Source Code	Yes	Our code is attached and will be released upon publication.
Open Datasets	Yes	Our code is implemented in JAX [7] with Flax [19].
Dataset Splits	No	The paper refers to using specific datasets (e.g., D4RL benchmark, antmaze-v0, gym-locomotion-v2) but does not provide explicit train/validation/test splits by percentages or sample counts in the main text. It mentions 'average the mean returns over 10 evaluation trajectories and 5 random seeds' and 'evaluate the antmaze-v0 environments for 100 episodes instead', which implies evaluation on a test set, but no split details.
Hardware Specification	Yes	All experiments are conducted on NVIDIA 3090 GPUs.
Software Dependencies	No	The paper mentions software like JAX [7] and Flax [19], and base RL algorithms like SAC [17], but does not provide specific version numbers for these software dependencies (e.g., 'JAX 0.3.17' or 'Flax 0.6.9').
Experiment Setup	Yes	We use ELU activation function [11] and SAC [17] as the base RL algorithm. Besides, we use a learning rate of 1 10 4 for both the policy network and Q-value network with a cosine learning rate scheduler. When approximating Eπθ(a\|s) e Tψ(s,a) , we use 50 Monte-Carlo samples.