Natural Value Approximators: Learning when to Trust Past Estimates
Authors: Zhongwen Xu, Joseph Modayil, Hado P. van Hasselt, Andre Barreto, David Silver, Tom Schaul
NeurIPS 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that this one change leads to significant improvements on multiple Atari games, when applied to the state-of-the-art A3C algorithm. ... In this section, we integrate our method within A3C (Asynchronous advantage actor-critic [9]), ... We investigate the performance of natural value estimates on a collection of 57 video games games from the Atari Learning Environment [1], which has become a standard benchmark for Deep RL methods because of the rich diversity of challenges present in the various games. |
| Researcher Affiliation | Industry | Zhongwen Xu Deep Mind EMAIL Joseph Modayil Deep Mind EMAIL Hado van Hasselt Deep Mind EMAIL Andre Barreto Deep Mind EMAIL David Silver Deep Mind EMAIL Tom Schaul Deep Mind EMAIL |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any statements or links indicating that open-source code for their method is available. |
| Open Datasets | Yes | We investigate the performance of natural value estimates on a collection of 57 video games games from the Atari Learning Environment [1], which has become a standard benchmark for Deep RL methods because of the rich diversity of challenges present in the various games. |
| Dataset Splits | No | The paper describes evaluation metrics and conditions ('human starts' and 'no-op starts') but does not specify traditional train/validation/test dataset splits with percentages or counts. |
| Hardware Specification | No | We train agents for 80 Million agent steps (320 Million Atari game frames) on a single machine with 16 cores. This mentions the number of CPU cores but lacks specific CPU or GPU models, memory details, or other hardware specifications. |
| Software Dependencies | No | The paper mentions 'A3C algorithm' and 'Adam' optimizer but does not provide specific version numbers for these or other software components. |
| Experiment Setup | Yes | The network architecture is composed of three layers of convolutions, followed by a fully connected layer with output h, which feeds into the two separate heads (π with an additional softmax, and a scalar v...). The updates are done online with a buffer of the past 20-state transitions. The value targets are n-step targets Zn t... We train agents for 80 Million agent steps (320 Million Atari game frames)... we set k to 50. The networks are trained for 5000 steps using Adam [5] with minibatch size 32. |