Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

Authors: Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, Tim Genewein

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present a benchmark to pressure-test today s frontier models multimodal decisionmaking capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.
Researcher Affiliation Industry 1Google DeepMind. Correspondence to: Anian Ruoss <EMAIL>, Tim Genewein <EMAIL>.
Pseudocode Yes Listing 1. The frozen part of the evaluation prompt, which contains the expert demonstration episodes and stays constant throughout an evaluation episode. In this example, we have 8 demonstration episodes with 10 steps each and RGB observations. Before feeding the prompt to the model, we replace the observation and action placeholders with the actual observations (i.e., images in this case) and action strings. Listing 2. The dynamic part of the evaluation prompt containing the evaluation trajectory. While stepping through an environment, we append this prompt to the one in Listing 1 in every evaluation step (e.g., for the 3rd step here), again replacing the observation and action placeholders with the actual observations and actions.
Open Source Code Yes We open-source our in-context imitation learning benchmark that covers the zero-, few-, and many-shot regime in a unified manner, including all expert demonstrations and evaluation code at https://github.com/google-deepmind/lm_act.
Open Datasets Yes Atari Phoenix We chose Phoenix as a representative Atari task... We use the Arcade Learning Environment (Bellemare et al., 2013) version, and for expert demonstrations we use the Gato training data (Reed et al., 2022). Crossword We create a large collection of 7 7 crosswords using the genxword crossword generator (Whitlock, 2011) and a list of 55 189 clues with the lowest difficulty rating collected by Matthew Ginsberg... DM Control Cheetah Run ...we use the Gato training data (Reed et 2022) to create expert demonstrations (details in Appendix B.2).
Dataset Splits Yes We always evaluate 100 episodes with different initial conditions (each episode is evaluated individually) and report the average score... For each evaluation episode we uniformly subsample (without replacement) the demonstration episodes (for the frozen part of the prompt) from a precomputed pool of up to 1000 distinct demonstrations. We ensure that all evaluation episodes have initial states that differ from the demonstration episodes (except for the replay control experiments in Appendix C.2).
Hardware Specification No We perform an evaluation via closed-source APIs and thus have little control over how the data is processed and fed to the underlying models. Since models behind the APIs can be updated at any time, it is possible that our results may not be quantitatively reproducible soon after publishing this manuscript. At the time of writing, querying long context models with many tokens comes with high computational cost, but it would be interesting to investigate how specialized models that were specifically developed for long-context tasks (Bulatov et al., 2022; Cherepanov et al., 2024) would fare on our benchmark.
Software Dependencies Yes We evaluate models against the weakest-possible version of Stockfish 16 (Romstad et al., 2008)... We create a large collection of 7 7 crosswords using the genxword crossword generator (Whitlock, 2011)... all of which we generate with the python-chess library (Fiekas, 2012).
Experiment Setup Yes We use temperature 0 for all models (except for o1-mini, o1-preview, and o1, which have a fixed temperature of 1 (Open AI, 2024d)). We set the maximum (output) sample length to 2048 tokens for all models (except for o1-mini, o1-preview, and o1)... so we use a maximum (output) sample length of 8192 tokens as a good compromise between cost and performance... We post-process the model outputs by removing all the leading/trailing white spaces and only consider the text after the keyword Action: ... For each model and task, we first ablate whether to use chain-of-thought prompting and whether to show the legal actions in the prompt (results in Appendix C.4).