A Generalist Agent
Authors: Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, Nando de Freitas
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this report we describe the model and the data, and document the current capabilities of Gato. ... With a single set of weights, Gato can engage in dialogue, caption images, stack blocks with a real robot arm, outperform humans at playing Atari games, navigate in simulated 3D environments, follow instructions, and more. ... In this section, we summarize the performance of Gato when trained on the above described data. That is, all results across all tasks are derived from a single pretrained model with a single set of weights. Results with fine-tuning will be presented in Section 5. ... Figure 5: Gato’s performance on simulated control tasks. Number of tasks where the performance of the pretrained model is above a percentage of expert score, grouped by domain. |
| Researcher Affiliation | Industry | Scott Reed*, , Konrad Żołna*, Emilio Parisotto*, Sergio Gómez Colmenarejo , Alexander Novikov, Gabriel Barth-Maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar and Nando de Freitas *Equal contributions, Equal senior contributions, All authors are affiliated with Deep Mind EMAIL |
| Pseudocode | No | The paper describes the tokenization, network architecture, loss function, and deployment processes of Gato using descriptive text and figures (e.g., Figure 2: Training phase of Gato, Figure 3: Running Gato as a control policy) but does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement that the source code for the methodology described is publicly available, nor does it provide a direct link to a code repository. The Model card in Appendix A states: "Out-of-Scope Uses Not intended for commercial or production use.". |
| Open Datasets | Yes | Gato is trained on a large number of datasets comprising agent experience in both simulated and real world environments, as well as a variety of natural language and image datasets. The datasets we use and their attributes are listed in Table 1. ... The simulated environments include Meta-World (Yu et al., 2020) introduced to benchmark meta-reinforcement learning and multi-task learning, Sokoban (Racanière et al., 2017) proposed as a planning problem, Baby AI (Chevalier-Boisvert et2018) for language instruction following in grid-worlds, the DM Control Suite (Tunyasuvunakool et al., 2020) for continuous control, as well as DM Lab (Beattie et al., 2016) ... We also use the Arcade Learning Environment (Bellemare et al., 2013) ... We as well include the Procgen Benchmark (Cobbe et al., 2020) and Modular RL (Huang et al., 2020). ... Gato is trained on Massive Text (Rae et al., 2021), a collection of large English-language text datasets... We also included several vision-language datasets in Gato’s training. ALIGN (Jia et al., 2021) consists of 1.8B images... LTIP (Long Text & Image Pairs)... Conceptual captions (Sharma et al., 2018) and COCO captions (Chen et al., 2015)... The Multi Modal Massive Web (M3W) dataset (Alayrac et al., 2022)... visual question-answering datasets. In particular OKVQA (Marino et al., 2019) and VQAv2 (Antol et al., 2015)... |
| Dataset Splits | Yes | Each batch mixes subsequences approximately uniformly over domains (e.g. Atari, Massive Web, etc.), with some manual upweighting of larger and higher quality datasets (see Table 1 in Section 3 for details). ... For this reason, we held-out all data for four tasks from our pre-training set: cartpole.swingup (DM Control Suite domain), assembly-v2 (Meta-World domain), order_of_apples_forage_simple (DM Lab domain), and boxing (ALE Atari domain). ... For the fine-tuning datasets... We randomly took 1000 episodes (out of 2000 preselected episodes), then a subset of 100 episodes from the selected episodes, then 10, 5, 3, and finally a single episode. We repeated this procedure 3 times to obtain 3 series of cascading subsets for each task. |
| Hardware Specification | Yes | Training of the model is performed on a 16x16 TPU v3 slice for 1M steps with batch size 512 and token sequence length L = 1024, which takes about 4 days. ... We found that the 1.18B parameter model was able to run on the hardware accelerators in our robots (NVidia GeForce RTX 3090s), but still overran the 20Hz control rate by a small amount (~0.01 seconds). |
| Software Dependencies | No | The paper mentions several tools, environments, and models such as SentencePiece, ViT, ResNet, Transformer, Adam W, Adam, Group Norm, GELU, MuJoCo, and OpenAI Gym, but it does not specify exact version numbers for these software components or libraries. |
| Experiment Setup | Yes | Gato uses a 1.2B parameter decoder-only transformer with 24 layers, an embedding size of 2048, and a post-attention feedforward hidden size of 8196 (more details in Section C.1). ... Training of the model is performed on a 16x16 TPU v3 slice for 1M steps with batch size 512 and token sequence length L = 1024, which takes about 4 days. ... For all models we use the Adam W (Loshchilov & Hutter, 2017) optimizer with a linear warmup and cosine schedule decay. The linear warmup lasts for 15,000 steps, starting from a learning rate of 1e-7 and ending at a different maximum learning rate depending on the model (see Table 6). This learning rate is then cosine decayed by a factor 10x over 1,000,000 steps. The Adam W optimizer has parameters β1 = 0.9, β2 = 0.95 and ϵ = 1e-8. ... We train with an Adam W weight decay parameter of 0.1. Additionally, we use stochastic depth (Huang et al., 2016) during pretraining, where each of the transformer sub-layers (i.e. each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1. ... For all models we use the Adam (Kingma & Ba, 2014) optimizer with a constant learning rate of 1e-5. ... We use a batch size of 64 and a sequence length of 1024 tokens for all models. We train for 10,000 gradient steps. Regularization: We use dropout (Srivastava et al., 2014) with a rate of 0.1. |