reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Partially Observable Reinforcement Learning with Memory Traces

Authors: Onno Eberhard, Michael Muehlebach, Claire Vernade

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we underline the effectiveness of memory traces empirically in online reinforcement learning experiments for both value prediction and control. 6. Experiments We now evaluate the effectiveness of memory traces as a practical alternative to windows in online reinforcement learning. Our first experiment considers the setting of online policy evaluation by temporal difference learning with linear function approximation. In our second experiment, we test the potential of memory traces for deep reinforcement learning.
Researcher Affiliation	Academia	1Max Planck Institute for Intelligent Systems, T ubingen, Germany 2University of T ubingen. Correspondence to: Onno Eberhard <EMAIL>.
Pseudocode	No	The paper describes methods narratively and mathematically, but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our implementations of both TD learning and PPO, as well as the two environments (Sutton s noisy random walk and the T-maze) are available online at https://github.com/onnoeberhard/memory-traces.
Open Datasets	Yes	The environment that forms the basis for our first experiment is a modified version of Sutton s random walk (Sutton & Barto, 2018, Example 9.1). ... We construct a Minigrid (Chevalier-Boisvert et al., 2024) version of this environment, shown inset in Fig. 5 (with corridor length k = 8)
Dataset Splits	No	The paper describes using reinforcement learning environments (Sutton's noisy random walk and T-maze) for online learning. These environments involve continuous interaction rather than predefined dataset splits for training, validation, and testing.
Hardware Specification	No	The paper mentions running experiments with PPO and TD learning but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for these experiments.
Software Dependencies	No	The paper mentions using Adam optimizer and libraries like xminigrid, but it does not specify version numbers for these software components or other key dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	The hyperparameters we used in our PPO experiments are compiled in Table 3. ... Our TD experiments ran for 100,000 steps and we performed a hyperparameter search for the step size α over a range of 13 logarithmically spaced values between 0.0001 and 1.0. ... keep the step size constant at α = 0.02 for all values of λ for the memory trace. Table 3. PPO hyperparameters Parameter Value Total number of steps 1,024,000,000 Number of parallel environments 16 Number of steps per update 128 Learning rate 0.0003 Generalized advantage estimation λ 0.95 Number of epochs 2 Number of minibatches 8 Clipping parameter ϵ 0.2 Value loss weight 0.5 Entropy coefficient 0.01 Maximum gradient norm 0.5