Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient

Authors: Wenlong Wang, Ivana Dusparic, Yucheng Shi, Ke Zhang, Vinny Cahill

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Drama on the Atari100k benchmark, demonstrating that it achieves performance comparable to other SOTA algorithms while using only a 7 million trainable parameter world model. We present three ablation experiments to evaluate key components of Drama... In Table 1, the Normalised Mean refers to the average normalised score... As shown in Figure 4 in the appendix, Drama significantly outperforms Dreamer V3XS, achieving a normalised mean score of 105 compared to 37 and a normalised median score of 27 compared to 7, as presented in Table 3.
Researcher Affiliation Academia Wenlong Wang, Ivana Dusparic, Yucheng Shi, Ke Zhang & Vinny Cahill School of Computer Science and Statistics Trinity College Dublin, the University of Dublin College Green, Dublin 2, Ireland EMAIL
Pseudocode Yes Algorithm 1 Training the world model and the behaviour policy
Open Source Code Yes Our code is available at https://github.com/realwenlongwang/Drama.git.
Open Datasets Yes We evaluate the model using the Atari100k benchmark (Kaiser et al., 2020), which is widely used for assessing the sample efficiency of RL algorithms.
Dataset Splits Yes Atari100k limits interactions with the environment to 100,000 steps (equivalent to 400,000 frames with 4-frame skipping). For each game, we train Drama with 5 independent seeds and track training performance using a 5-episode running average, as recommended by Machado et al. (2018), a practice also followed in related work (Hafner et al., 2023).
Hardware Specification Yes Drama is accessible and trainable on off-the-shelf hardware, such as a standard laptop. Experiments were conducted on a consumer-grade laptop with an NVIDIA RTX 2000 Ada Mobile GPU, ensuring practical relevance to resource-constrained settings.
Software Dependencies No The paper mentions being "implemented on top of the STORM infrastructure (Zhang et al., 2023)" and references an implementation using "pure Py Torch and MLX" for a specific experiment. However, it does not provide specific version numbers for key software components used for the main Drama implementation.
Experiment Setup Yes Appendix A.4 LOSS AND HYPERPARAMETERS details the hyperparameters for various components: A.4.1 VARIATIONAL AUTOENCODER (Table 5), A.4.2 MAMBA AND MAMBA-2 (Table 6), A.4.3 REWARD AND TERMINATION PREDICTION HEADS (Table 7), and A.4.4 ACTOR CRITIC HYPERPARAMETERS (Table 8). These tables specify values like learning rate, frame shape, layers, filters, stride, kernel, weight decay, activation functions, hidden state dimensions, gamma, lambda, entropy coefficient, max gradient norm, and batch sizes.