Partially Observable Reinforcement Learning with Memory Traces

Authors: Onno Eberhard, Michael Muehlebach, Claire Vernade

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we underline the effectiveness of memory traces empirically in online reinforcement learning experiments for both value prediction and control. 6. Experiments We now evaluate the effectiveness of memory traces as a practical alternative to windows in online reinforcement learning. Our first experiment considers the setting of online policy evaluation by temporal difference learning with linear function approximation. In our second experiment, we test the potential of memory traces for deep reinforcement learning.
Researcher Affiliation Academia 1Max Planck Institute for Intelligent Systems, T ubingen, Germany 2University of T ubingen. Correspondence to: Onno Eberhard <EMAIL>.
Pseudocode No The paper describes methods narratively and mathematically, but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our implementations of both TD learning and PPO, as well as the two environments (Sutton s noisy random walk and the T-maze) are available online at https://github.com/onnoeberhard/memory-traces.
Open Datasets Yes The environment that forms the basis for our first experiment is a modified version of Sutton s random walk (Sutton & Barto, 2018, Example 9.1). ... We construct a Minigrid (Chevalier-Boisvert et al., 2024) version of this environment, shown inset in Fig. 5 (with corridor length k = 8)
Dataset Splits No The paper describes using reinforcement learning environments (Sutton's noisy random walk and T-maze) for online learning. These environments involve continuous interaction rather than predefined dataset splits for training, validation, and testing.
Hardware Specification No The paper mentions running experiments with PPO and TD learning but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for these experiments.
Software Dependencies No The paper mentions using Adam optimizer and libraries like xminigrid, but it does not specify version numbers for these software components or other key dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup Yes The hyperparameters we used in our PPO experiments are compiled in Table 3. ... Our TD experiments ran for 100,000 steps and we performed a hyperparameter search for the step size α over a range of 13 logarithmically spaced values between 0.0001 and 1.0. ... keep the step size constant at α = 0.02 for all values of λ for the memory trace. Table 3. PPO hyperparameters Parameter Value Total number of steps 1,024,000,000 Number of parallel environments 16 Number of steps per update 128 Learning rate 0.0003 Generalized advantage estimation λ 0.95 Number of epochs 2 Number of minibatches 8 Clipping parameter ϵ 0.2 Value loss weight 0.5 Entropy coefficient 0.01 Maximum gradient norm 0.5