Partially Observable Reinforcement Learning with Memory Traces
Authors: Onno Eberhard, Michael Muehlebach, Claire Vernade
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we underline the effectiveness of memory traces empirically in online reinforcement learning experiments for both value prediction and control. 6. Experiments We now evaluate the effectiveness of memory traces as a practical alternative to windows in online reinforcement learning. Our first experiment considers the setting of online policy evaluation by temporal difference learning with linear function approximation. In our second experiment, we test the potential of memory traces for deep reinforcement learning. |
| Researcher Affiliation | Academia | 1Max Planck Institute for Intelligent Systems, T ubingen, Germany 2University of T ubingen. Correspondence to: Onno Eberhard <EMAIL>. |
| Pseudocode | No | The paper describes methods narratively and mathematically, but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our implementations of both TD learning and PPO, as well as the two environments (Sutton s noisy random walk and the T-maze) are available online at https://github.com/onnoeberhard/memory-traces. |
| Open Datasets | Yes | The environment that forms the basis for our first experiment is a modified version of Sutton s random walk (Sutton & Barto, 2018, Example 9.1). ... We construct a Minigrid (Chevalier-Boisvert et al., 2024) version of this environment, shown inset in Fig. 5 (with corridor length k = 8) |
| Dataset Splits | No | The paper describes using reinforcement learning environments (Sutton's noisy random walk and T-maze) for online learning. These environments involve continuous interaction rather than predefined dataset splits for training, validation, and testing. |
| Hardware Specification | No | The paper mentions running experiments with PPO and TD learning but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for these experiments. |
| Software Dependencies | No | The paper mentions using Adam optimizer and libraries like xminigrid, but it does not specify version numbers for these software components or other key dependencies like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | The hyperparameters we used in our PPO experiments are compiled in Table 3. ... Our TD experiments ran for 100,000 steps and we performed a hyperparameter search for the step size α over a range of 13 logarithmically spaced values between 0.0001 and 1.0. ... keep the step size constant at α = 0.02 for all values of λ for the memory trace. Table 3. PPO hyperparameters Parameter Value Total number of steps 1,024,000,000 Number of parallel environments 16 Number of steps per update 128 Learning rate 0.0003 Generalized advantage estimation λ 0.95 Number of epochs 2 Number of minibatches 8 Clipping parameter ϵ 0.2 Value loss weight 0.5 Entropy coefficient 0.01 Maximum gradient norm 0.5 |