Adaptive patch foraging in deep reinforcement learning agents

Authors: Nathan Wispinski, Andrew Butcher, Kory Wallace Mathewson, Craig S Chapman, Matthew Botvinick, Patrick M. Pilarski

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we investigate deep reinforcement learning agents in an ecological patch foraging task. For the first time, we show that machine learning agents can learn to patch forage adaptively in patterns similar to biological foragers, and approach optimal patch foraging behavior when accounting for temporal discounting. Finally, we show emergent internal dynamics in these agents that resemble single-cell recordings from foraging non-human primates, which complements experimental and theoretical work on the neural mechanisms of biological foraging. This work suggests that agents interacting in complex environments with ecologically valid pressures arrive at common solutions, suggesting the emergence of foundational computations behind adaptive, intelligent behavior in both biological and artificial agents. 2 Experiments A continuous 3D environment was selected to approximate the rich sensorimotor experience involved in ecological foraging experiments.
Researcher Affiliation Collaboration Nathan J. Wispinski EMAIL University of Alberta, Edmonton, Canada (work conducted while an intern with Deep Mind, Edmonton, Canada) Andrew Butcher EMAIL Deep Mind, Edmonton, Canada Kory W. Mathewson EMAIL Deep Mind, Montreal, Canada Craig S. Chapman EMAIL University of Alberta, Edmonton, Canada Matthew M. Botvinick EMAIL Deep Mind, London, UK Patrick M. Pilarski EMAIL Deep Mind, Edmonton, Canada University of Alberta, Edmonton, Canada Alberta Machine Intelligence Institute (Amii), Edmonton, Canada
Pseudocode No The paper describes the Maximum A Posteriori Policy Optimization (MPO) algorithm as the training method but does not provide pseudocode or an algorithm block for it. The descriptions of the environment, agent architecture, and training process are given in paragraph form.
Open Source Code No The paper does not provide an explicit statement about releasing code for the methodology described, nor does it include a link to a code repository. It mentions videos in supplementary material, and refers to previous work for architecture and hyperparameters, but not code release for this specific paper.
Open Datasets No The paper describes a custom-built 3D simulation environment and generates data from it, but does not provide concrete access information (link, DOI, or repository) for this environment or the datasets used/generated in the experiments. It refers to 'Cultural General Intelligence Team et al., 2022' for environment details, but this is a citation, not access to a dataset.
Dataset Splits No The paper describes how agents were trained and evaluated (e.g., 'Trained agents were evaluated on 50 episodes of each evaluation patch distance (i.e., 6, 8, 10, and 12 m)'). However, it does not provide specific training/test/validation dataset splits with exact percentages, sample counts, or predefined citations that would allow reproduction of data partitioning.
Hardware Specification Yes Each agent was trained on an internal cluster for roughly 13 days, and used approximately 40 Gi B RAM, 8 CPU, and 8 GPUs.
Software Dependencies No The paper mentions using the 'Adam optimizer' and the 'maximum a posteriori policy optimization (MPO)' algorithm. However, it does not provide specific version numbers for any software libraries, frameworks (e.g., TensorFlow, PyTorch), or other dependencies.
Experiment Setup Yes Three agents were trained in each of four discount rate treatments (N = 12), selected on the basis of MVT simulations (Figure 3d). Agents were each initialized with a different random seed, and trained for 12e7 steps using the Adam optimizer (Kingma & Ba, 2014) and a learning rate of 3e 4. On each training episode, patch distance was drawn from a random uniform distribution between 5 m and 12 m, and held constant for each episode. For all experiments, the initial patch reward, N0, was set to 1/30, and the patch reward decay rate, λ, was set to 0.01 (Figure 1c). Each episode terminated after 3600 steps.