ChainerRL: A Deep Reinforcement Learning Library
Authors: Yasuhiro Fujita, Prabhat Nagarajan, Toshiki Kataoka, Takahiro Ishikawa
JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To foster reproducible research, and for instructional purposes, Chainer RL provides scripts that closely replicate the original papers experimental settings and reproduce published benchmark results for several algorithms. Lastly, Chainer RL offers a visualization tool that enables the qualitative inspection of trained agents. The entire Section 3, 'Reproducibility,' details experiments and comparisons against published results on Atari and MuJoCo benchmarks, providing extensive tables (Tables 2 and 4) of performance metrics. |
| Researcher Affiliation | Collaboration | Yasuhiro Fujita EMAIL Prabhat Nagarajan EMAIL Toshiki Kataoka EMAIL Preferred Networks Tokyo, Japan and Takahiro Ishikawa EMAIL The University of Tokyo Tokyo, Japan. The affiliations include a company (Preferred Networks) and a university (The University of Tokyo), indicating a collaboration. |
| Pseudocode | Yes | Appendix D. Pseudocode: The following pseudocode depicts the simplicity of creating and training a Rainbow agent with Chainer RL. 1 import chainerrl as crl 2 import gym 4 q_func = crl.q_functions.Distributional Dueling DQN (...)# dueling 5 crl.links.to_factorized_noisy (q_func) # noisy networks 6 # Prioritized Experience Replay Buffer with a 3-step reward 7 per = crl.replay_buffers.Prioritized Replay Buffer ( num_step_return =3 ,...) 8 # Create a rainbow agent 9 rainbow = crl.agents.Categorical Double DQN (per , q_func ,...) 10 num_envs = 5 # Train in five environments 11 env = crl.envs.Multiprocess Vector Env ( 12 [gym.make('Breakout') for _ in range(num_envs)]) 14 # Train the agent and collect evaluation statistics 15 crl.experiments.train_agent_batch_with_evaluation (rainbow , env , steps =...) |
| Open Source Code | Yes | The Chainer RL source code can be found on Git Hub: https://github.com/chainer/chainerrl. |
| Open Datasets | Yes | For the Atari benchmark (Bellemare et al., 2013), we have successfully reproduced DQN, IQN, Rainbow, and A3C. For the Open AI Gym Mujoco benchmark tasks, we have successfully reproduced DDPG, TRPO, PPO, TD3, and SAC. These are well-known and publicly available benchmark environments. |
| Dataset Splits | Yes | Table 3: Evaluation protocols used for the Atari reproductions. Eval Frequency (timesteps) 250K Eval Phase (timesteps) 125K Eval Episode Length (time) 5 min / 30 min Eval Episode Policy ϵ = 0.05 / ϵ = 0.001 / ϵ = 0.0 / N/A Reporting Protocol re-eval / best-eval. These details specify how the environment interactions are structured for evaluation, analogous to dataset splits in RL. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, processor types, memory amounts) are mentioned for the experiments. The paper only refers to 'CPU and GPU training' generally. |
| Software Dependencies | No | The paper mentions 'Python and the Chainer deep learning framework' but does not provide specific version numbers for Python, Chainer, or other software dependencies like Open AI Gym. |
| Experiment Setup | Yes | Table 3 provides detailed evaluation protocols for Atari reproductions, including 'Eval Frequency (timesteps) 250K', 'Eval Phase (timesteps) 125K', 'Eval Episode Length (time) 5 min', and 'Reporting Protocol re-eval'. The caption for Table 4 states: 'For DDPG and TD3, each Chainer RL score represents the maximum evaluation score during 1M-step training, averaged over 10 trials with different random seeds, where each evaluation phase of ten episodes is run after every 5000 steps. For PPO and TRPO, each Chainer RL score represents the final evaluation of 100 episodes after 2M-step training, averaged over 10 trials with different random seeds. For SAC, each Chainer RL score reports the final evaluation of 10 episodes after training for 1M (Hopper-v2), 3M (Half Cheetah-v2, Walker2d-v2, and Ant-v2), or 10M (Humanoid-v2) steps, averaged over 10 trials with different random seeds.' |