Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning
Authors: Jonathan Cook, Chris Lu, Edward Hughes, Joel Z. Leibo, Jakob Foerster
NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of both the in-context and in-weights models by showing sustained generational performance gains on several tasks requiring exploration under partial observability. On each task, we find that accumulating agents outperform those that learn for a single lifetime of the same total experience budget. |
| Researcher Affiliation | Collaboration | Jonathan Cook FLAIR, University of Oxford EMAIL Chris Lu FLAIR, University of Oxford EMAIL Edward Hughes Google Deep Mind EMAIL Joel Z. Leibo Google Deep Mind EMAIL Jakob Foerster FLAIR, University of Oxford EMAIL |
| Pseudocode | Yes | Algorithm 1 Training Loop for In-Context Accumulation (changes to RL2 in red); Algorithm 2 In-Context Accumulation During Evaluation |
| Open Source Code | Yes | Code can be found at https://github.com/FLAIROx/cultural-accumulation. |
| Open Datasets | No | The paper introduces custom environments (Memory Sequence, Goal Sequence, Travelling Salesperson) which are released as part of their open-source code, but does not provide concrete access information for a pre-existing, static public dataset. |
| Dataset Splits | No | The paper describes training and testing on environment instances but does not explicitly provide details about a separate validation dataset split. |
| Hardware Specification | Yes | Memory Sequence and TSP experiments were run on a single NVIDIA RTX A40 GPU (40GB memory)... Training of in-context learners in Goal Sequence was run in under 8 minutes on 4 A40s... In-weights accumulation in Goal Sequence was run in 30 minutes on 4 A40s. |
| Software Dependencies | No | The paper mentions software components like 'Pure Jax RL codebase', 'PPO', and 'S5' but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | Appendix F Hyperparameters: population size, learning rate, batch size, rollout length, update epochs, minibatches, γ, λGAE, ϵ clip, entropy coefficient, value coefficient, max gradient norm, anneal learning rate are specified for Memory Sequence, TSP, and Goal Sequence experiments. |