Adaptive patch foraging in deep reinforcement learning agents
Authors: Nathan Wispinski, Andrew Butcher, Kory Wallace Mathewson, Craig S Chapman, Matthew Botvinick, Patrick M. Pilarski
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we investigate deep reinforcement learning agents in an ecological patch foraging task. For the first time, we show that machine learning agents can learn to patch forage adaptively in patterns similar to biological foragers, and approach optimal patch foraging behavior when accounting for temporal discounting. Finally, we show emergent internal dynamics in these agents that resemble single-cell recordings from foraging non-human primates, which complements experimental and theoretical work on the neural mechanisms of biological foraging. This work suggests that agents interacting in complex environments with ecologically valid pressures arrive at common solutions, suggesting the emergence of foundational computations behind adaptive, intelligent behavior in both biological and artificial agents. 2 Experiments A continuous 3D environment was selected to approximate the rich sensorimotor experience involved in ecological foraging experiments. |
| Researcher Affiliation | Collaboration | Nathan J. Wispinski EMAIL University of Alberta, Edmonton, Canada (work conducted while an intern with Deep Mind, Edmonton, Canada) Andrew Butcher EMAIL Deep Mind, Edmonton, Canada Kory W. Mathewson EMAIL Deep Mind, Montreal, Canada Craig S. Chapman EMAIL University of Alberta, Edmonton, Canada Matthew M. Botvinick EMAIL Deep Mind, London, UK Patrick M. Pilarski EMAIL Deep Mind, Edmonton, Canada University of Alberta, Edmonton, Canada Alberta Machine Intelligence Institute (Amii), Edmonton, Canada |
| Pseudocode | No | The paper describes the Maximum A Posteriori Policy Optimization (MPO) algorithm as the training method but does not provide pseudocode or an algorithm block for it. The descriptions of the environment, agent architecture, and training process are given in paragraph form. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code for the methodology described, nor does it include a link to a code repository. It mentions videos in supplementary material, and refers to previous work for architecture and hyperparameters, but not code release for this specific paper. |
| Open Datasets | No | The paper describes a custom-built 3D simulation environment and generates data from it, but does not provide concrete access information (link, DOI, or repository) for this environment or the datasets used/generated in the experiments. It refers to 'Cultural General Intelligence Team et al., 2022' for environment details, but this is a citation, not access to a dataset. |
| Dataset Splits | No | The paper describes how agents were trained and evaluated (e.g., 'Trained agents were evaluated on 50 episodes of each evaluation patch distance (i.e., 6, 8, 10, and 12 m)'). However, it does not provide specific training/test/validation dataset splits with exact percentages, sample counts, or predefined citations that would allow reproduction of data partitioning. |
| Hardware Specification | Yes | Each agent was trained on an internal cluster for roughly 13 days, and used approximately 40 Gi B RAM, 8 CPU, and 8 GPUs. |
| Software Dependencies | No | The paper mentions using the 'Adam optimizer' and the 'maximum a posteriori policy optimization (MPO)' algorithm. However, it does not provide specific version numbers for any software libraries, frameworks (e.g., TensorFlow, PyTorch), or other dependencies. |
| Experiment Setup | Yes | Three agents were trained in each of four discount rate treatments (N = 12), selected on the basis of MVT simulations (Figure 3d). Agents were each initialized with a different random seed, and trained for 12e7 steps using the Adam optimizer (Kingma & Ba, 2014) and a learning rate of 3e 4. On each training episode, patch distance was drawn from a random uniform distribution between 5 m and 12 m, and held constant for each episode. For all experiments, the initial patch reward, N0, was set to 1/30, and the patch reward decay rate, λ, was set to 0.01 (Figure 1c). Each episode terminated after 3600 steps. |