Real-Time Recurrent Reinforcement Learning
Authors: Julian Lemmel, Radu Grosu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that the method is capable of solving a diverse set of partially observable reinforcement learning tasks. The algorithm we call real-time recurrent reinforcement learning (RTRRL) serves as a model of learning in biological neural networks, mimicking reward pathways in the basal ganglia. ... We evaluate the feasibility of our RTRRL approach by testing on RL benchmarks provided by the gymnax (Lange 2022), popgym (Morad et al. 2022) and brax (Freeman et al. 2021) packages. ... Figure 3: Bar-charts of combined normalized validation rewards achieved for 5 runs each on a range of different tasks. ... Ablation Experiments. |
| Researcher Affiliation | Collaboration | Julian Lemmel1,2, Radu Grosu1 1 Vienna University of Technology 2 Daten Vorsprung Gmb H EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: RTRRL Require: Linear policy: A(a|h) Require: Linear value-function: ˆv C(h) Require: Recurrent layer: RNN R([o, a, r], h, ˆJ) 1: A, C, R initialize parameters 2: BA, BC initialize feedback matrices 3: h, e A, e C, e R 0 4: o reset Environment 5: h, ˆJ RNN R([o, 0, 0], h, 0) 6: v ˆv C(h) 7: while not done do 8: A(h) 9: a sample( ) 10: o, r take action a 11: h0, ˆJ0 RNN R([o, a, r], h, ˆJ) 12: e C γλCe C + r C ˆv 13: e A γλAe A + r A log [a] 14: g C BC1 15: g A BAr log [a] 16: e R γλRe R + ˆJ(g C + Ag A) 17: v0 ˆv C(h0) 18: δ r + γv0 v 19: C C + Cδe C 20: A A + Aδe A 21: R R + Rδe R 22: v v0, h h0, ˆJ ˆJ0 23: end while |
| Open Source Code | Yes | Code https://github.com/Franz Knut/RTRRL-AAAI25 |
| Open Datasets | Yes | We evaluate the feasibility of our RTRRL approach by testing on RL benchmarks provided by the gymnax (Lange 2022), popgym (Morad et al. 2022) and brax (Freeman et al. 2021) packages. |
| Dataset Splits | No | The paper evaluates on RL benchmarks/environments (gymnax, popgym, brax) which do not typically involve predefined train/test/validation dataset splits in the traditional sense. It mentions conducting experiments with '5 runs each' and '10 runs', but does not specify how a static dataset would be partitioned for training, validation, or testing for reproducibility. The context is continuous interaction with environments, not static dataset splitting. |
| Hardware Specification | No | Computational results have been achieved in part using the Vienna Scientific Cluster (VSC). No specific hardware details (e.g., GPU/CPU models, memory) of the cluster are provided. |
| Software Dependencies | No | Our implementation of PPO is based on purejaxrl (Lu et al. 2022). ... gymnax (Lange 2022), popgym (Morad et al. 2022) and brax (Freeman et al. 2021) packages. ... the adam (Kingma and Ba 2015) optimizer. The paper mentions software packages and an optimizer, along with citations, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | For each environment, we trained a network with 32 neurons for either a maximum of 50 million steps or until 20 subsequent epochs showed no improvement. The same set of hyperparameters, given in the Appendix, was used for all the RTRRL experiments if not stated otherwise. Importantly, a batch size of 1 was used to ensure biological plausibility. All λ s and γ were kept at 0.99, H was set to 10 5, and the adam (Kingma and Ba 2015) optimizer with a learning rate of 10 3 was used. |