reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

Authors: Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Manon Devin, Alex X. Lee, Maria Bauza Villalonga, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, Antoine Laurens, Claudio Fantacci, Valentin Dalibard, Martina Zambelli, Murilo Fernandes Martins, Rugile Pevceviciute, Michiel Blokzijl, Misha Denil, Nathan Batchelor, Thomas Lampe, Emilio Parisotto, Konrad Zolna, Scott Reed, Sergio Gómez Colmenarejo, Jonathan Scholz, Abbas Abdolmaleki, Oliver Groth, Jean-Baptiste Regli, Oleg Sushkov, Thomas Rothörl, Jose Enrique Chen, Yusuf Aytar, David Barker, Joy Ortiz, Martin Riedmiller, Jost Tobias Springenberg, Raia Hadsell, Francesco Nori, Nicolas Heess

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate the agent s capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop.
Researcher Affiliation	Industry	All authors are affiliated with Google Deep Mind, *Equal contributions, Equal senior contributions, 1Work done during an. Feedback on the model EMAIL, EMAIL
Pseudocode	No	The paper describes the model architecture and training process in Section 2, but does not include any clearly labeled pseudocode or algorithm blocks. It uses mathematical formulations and descriptive text.
Open Source Code	No	The paper mentions a third-party open-source library, Mo Ma, with a GitHub link (https://github.com/deepmind/dm_robotics/tree/main/py/moma) in Section C.1.3 footnote 5. However, it does not provide an explicit statement or link for the source code of the Robo Cat methodology described in the paper. It also states for a robot hand that 'Details of this robot hand will be released in the near future.' (Section 3.1).
Open Datasets	Yes	We train our VQ-GAN encoder on a diverse collection of images...Specifically, the encoder is trained on a dataset that consists of images from Image Net (Deng et al., 2009), images from the control tasks in Reed et al. (2022) including Atari and Mu Jo Co (Todorov et al., 2012) locomotion tasks, as well as images from our visual robotic manipulation dataset. We use a subset of the YCB object set (Calli et al., 2017), namely the fruit (apple, banana, peach, lemon, strawberry), shown in Figure 3(c). The YCB-i vegetables (carrot, cucumber, pepper, potato) and bowl, also shown in Figure 3(c), are inspired by, but not part of, the official YCB benchmark. This collection of textured and geometrically different objects introduces additional visual diversity and allows us to benchmark Robo Cat on tasks with everyday objects.
Dataset Splits	Yes	The full Robo Cat agent is trained on 240 tasks and fine-tuned on a further 13 tasks, for a total of 253 tasks. This includes data from 2 simulated and 3 real-world embodiments, 5 simulated and 11 real task families, and 123 simulated and 134 real objects. Table 1 summarises the tasks, organised separately for training and fine-tuning tasks. For each of the simulated and real tasks, we evaluate each model by averaging over 100 episodes (or more, if specified), using a different goal image for each episode as well as randomised initial states of the environment. When fine-tuning a generalist to a specific real-world task...we first evaluate the checkpoint every 5000 steps for 25 episodes each to assess the best performing checkpoint, and then evaluate that checkpoint for 100 episodes to measure the final performance. The training and held-out tasks are listed in Figure 7(a).
Hardware Specification	No	The paper describes the robotic hardware used (e.g., 36 real robots: 15 Panda, 17 Sawyer, and 4 KUKA arms; Robotiq 2F-85 gripper, Robotiq FT 300 force-torque sensor, Basler dart cameras) which are part of the experimental setup, but it does not specify the computing hardware (e.g., GPU models, CPU types, or memory) used to train the models themselves.
Software Dependencies	No	The paper mentions using specific algorithms and models like 'VQ-GAN (Esser et al., 2021)', 'Adam W optimiser (Loshchilov and Hutter, 2017)', and the 'Mo Ma library' (Section C.1 footnote 5) but does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch, TensorFlow versions) that would enable replication.
Experiment Setup	Yes	For training all Robo Cat models we use the Adam W optimiser (Loshchilov and Hutter, 2017) with a linear warm-up and cosine schedule decay. The linear warm-up lasts for 15 000 steps, starting from and ending at a different minimum and maximum learning rates depending on the model (see Table 13). This learning rate is then cosine decayed by a factor 10 over 2 000 000 steps. The Adam W optimiser has parameters β1 = 0.9, β2 = 0.95 and ϵ = 1e 8. We use a batch size of 256 and a sequence length of 1024 tokens for all models. We train with an Adam W weight decay parameter of 0.1. Additionally, we use stochastic depth (Huang et al., 2016) during pretraining, where each of the transformer sub-layers (i.e. each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1. For fine-tuning we use the Adam optimiser (Kingma and Ba, 2015) with a constant learning rate of 1e 5. The Adam optimiser has parameters β1 = 0.9, β2 = 0.95 and ϵ = 1e 8. We use a batch size of 32 and a sequence length of 1024 tokens for all models. We train for up to 50 000 gradient steps. As regularisation, we use dropout (Srivastava et al., 2014) with a rate of 0.1.