Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning

Authors: Jonathan Cook, Chris Lu, Edward Hughes, Joel Z. Leibo, Jakob Foerster

NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of both the in-context and in-weights models by showing sustained generational performance gains on several tasks requiring exploration under partial observability. On each task, we find that accumulating agents outperform those that learn for a single lifetime of the same total experience budget.
Researcher Affiliation Collaboration Jonathan Cook FLAIR, University of Oxford EMAIL Chris Lu FLAIR, University of Oxford EMAIL Edward Hughes Google Deep Mind EMAIL Joel Z. Leibo Google Deep Mind EMAIL Jakob Foerster FLAIR, University of Oxford EMAIL
Pseudocode Yes Algorithm 1 Training Loop for In-Context Accumulation (changes to RL2 in red); Algorithm 2 In-Context Accumulation During Evaluation
Open Source Code Yes Code can be found at https://github.com/FLAIROx/cultural-accumulation.
Open Datasets No The paper introduces custom environments (Memory Sequence, Goal Sequence, Travelling Salesperson) which are released as part of their open-source code, but does not provide concrete access information for a pre-existing, static public dataset.
Dataset Splits No The paper describes training and testing on environment instances but does not explicitly provide details about a separate validation dataset split.
Hardware Specification Yes Memory Sequence and TSP experiments were run on a single NVIDIA RTX A40 GPU (40GB memory)... Training of in-context learners in Goal Sequence was run in under 8 minutes on 4 A40s... In-weights accumulation in Goal Sequence was run in 30 minutes on 4 A40s.
Software Dependencies No The paper mentions software components like 'Pure Jax RL codebase', 'PPO', and 'S5' but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Appendix F Hyperparameters: population size, learning rate, batch size, rollout length, update epochs, minibatches, γ, λGAE, ϵ clip, entropy coefficient, value coefficient, max gradient norm, anneal learning rate are specified for Memory Sequence, TSP, and Goal Sequence experiments.