reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Handling Delay in Real-Time Reinforcement Learning

Authors: Ivan Anokhin, Rishav Rishav, Matt Riemer, Stephen Chung, Irina Rish, Samira Ebrahimi Kahou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate several architectures and show that those incorporating temporal skip connections achieve strong performance across various neuron execution times, reinforcement learning algorithms, and environments, including four Mujoco tasks and all Min Atar games. Moreover, we demonstrate parallel neuron computation can accelerate inference by 6-350% on standard hardware. Our investigation into temporal skip connections and parallel computations paves the way for more efficient RL agents in real-time setting. ... Experiments confirm the importance of skip connections and history-augmented observation (see Fig. 1b), and our analysis shows that the skip connection offers a fast but less refined path for processing inputs, while the main connections provide a slower but more refined path. Our results show that in many environments this allows the policy in a parallel computation setting to achieve similar performance to an oracle agent with an instantaneous forward pass, provided the inference time of a layer is not large. ... We perform our main experiments on Mujoco (Todorov et al., 2012), Mini Atar (Young & Tian, 2019) and Mini Grid (Chevalier-Boisvert et al., 2023) environments.
Researcher Affiliation	Collaboration	Ivan Anokin12, Rishav Rishav13, Matthew Riemer124, Stephen Chung5 Irina Rish126, Samira Ebrahimi Kahou136 1Mila 2Universit e de Montr eal 3University of Calgary 4IBM Research 5University of Cambridge 6CIFAR AI Chair EMAIL
Pseudocode	Yes	Algorithm 1 Soft Actor-Critic Algorithm with parallel neuron computation. ... Algorithm 2 PPO with parallel neuron computation.
Open Source Code	Yes	Full code is available at https://github.com/avecplezir/realtime-agent.
Open Datasets	Yes	We perform our main experiments on Mujoco (Todorov et al., 2012), Mini Atar (Young & Tian, 2019) and Mini Grid (Chevalier-Boisvert et al., 2023) environments.
Dataset Splits	No	The paper mentions training on Mujoco for 1 million steps, and Min Atar/Mini Grid for 10 million samples, and results are averaged across 3 seeds. However, it does not specify any explicit training, validation, or test dataset splits or how data is partitioned within these environments for evaluation.
Hardware Specification	Yes	We evaluated the speed-up caused by parallel computations of neurons on various hardware platforms, observing significant improvements in inference time when utilizing a GPU. Fig. 1a illustrates the percentage improvement in inference speed as the number of layers increases across different hardware configurations. GPU. For GPU setting we measured performance speed-up on a single A100SXM4 GPU with 40 GB memory. ... CPU. We evaluated the benefits of parallelizing layers using C++ multi-threading on a CPU with 32 cores and 32 GB of RAM.
Software Dependencies	No	The paper mentions 'default Pytorch software' and 'Eigen C++ library' but does not specify their version numbers.
Experiment Setup	Yes	Table 7: Hyperparameters used in experiments. Parameter Value SAC Mujoco Discount rate γ 0.99 Policy frequency 2 Target network frequency 1 Target smoothing coefficient 0.005 Policy learning rate 3e-4 Q-function learning rate 1e-3 Optimizer Adam Adam beta (0.9, 0.999) Adam epsilon 1e-8 Replay buffer size 1,000,000 Batch size 256 Learning starts 10,000 Entropy regularization Auto-tuned Target entropy scale 1 PPO Min Atar and Mini Grid Discount rate γ 0.99 Lambda for general advantage estimation 0.95 Entropy coefficient 0.01 Value function coefficient 0.5 Normalize advantages True Number of steps to unroll a policy 32 Number of environments 32 Update epochs 4 Learning rate 2.5e-4 Anneal lr True Optimizer Adam Adam beta (0.9, 0.999) Adam epsilon 1e-5 Maximum gradient norm for clipping 0.5