Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web

Authors: Hiroki Furuta, Yutaka Matsuo, Aleksandra Faust, Izzeddin Gur

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we introduce a new benchmark, called Comp Wo B 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%.
Researcher Affiliation Collaboration Hiroki Furuta1,2 Yutaka Matsuo2 Aleksandra Faust1 Izzeddin Gur1 EMAIL EMAIL 1Google Deep Mind 2The University of Tokyo
Pseudocode Yes To clarify their methodological difference, we provide the pseudo code in Appendix B. Moreover, we resolve the sub-optimal performance of transferred LMAs (Section 4.4) with data-rebalanced finetuning (Section 4.5).
Open Source Code Yes We develop Comp Wo B 1 , simulated web environments for LMAs to measure the generalization to the realistic task compositionality and complex instructions. 1https://github.com/google-research/google-research/tree/master/compositional_rl
Open Datasets Yes We first design a new controlled test bed, called Comp Wo B, with 50 compositional tasks by combining a set of base tasks based on their difficulty (Figure 1, right). Comp Wo B works as a hold-out test environment accompanying with Mini Wo B as a train environment (Figure 1, left).
Dataset Splits Yes Comp Wo B works as a hold-out test environment accompanying with Mini Wo B as a train environment (Figure 1, left). Each compositional task is implemented from 2 to 8 base tasks in a single-page or multi-page environment with instructions linked together using simple connectors such as and then . Only providing the knowledge about base tasks, we investigate the generalization performance of existing So TA prompted LMAs (Kim et al., 2023; Sun et al., 2023; Zheng et al., 2023) with planning, self-improvement, program synthesis, and structured prompts that are supported by gpt-3.5-turbo and gpt-4. Our findings indicate that their performance drops significantly, from 94.0% success on base tasks to 24.9% success on compositional tasks. In contrast, small-scale LMAs finetuned only on base tasks and zero-shot-transferred to compositional settings (i.e. transferred LMAs), deal with unknown task compositionality better, achieving 54.8% success rate on average. By rebalancing the data distribution, we train a new model, HTML-T5++, that achieves human-level performance on Mini Wo B and performs the best among all the LMAs on compositional tasks. In contrast, small-scale LMAs finetuned only on base tasks and zero-shot-transferred to compositional settings (i.e. transferred LMAs), deal with unknown task compositionality better, achieving 54.8% success rate on average. By rebalancing the data distribution, we also train a new model, HTML-T5++, that achieves human-level performance on Mini Wo B and performs the best among all the LMAs on compositional tasks. We further point out that LMAs struggle to handle complex instruction compositions permuting the order of sub-instructions, where prompted agents are more robust to the difference in the order of compositions compared to transferred agents (6.9% vs 23.8% drop in performance). Finally, we illustrate that instruction length and observation complexity are useful indicators of compositional task performance. We hold out 21K episodes (5% of 347K + 77K = 424K) as a validation split, and after the convergence, we choose the top-5 checkpoints that achieve higher offline validation accuracy, run those checkpoints online on Mini Wo B benchmarks, and then report the best success rate.
Hardware Specification Yes We have used cloud TPU-v3, which has a 32 Gi B HBM memory space, with 128 cores. Each finetuning experiment takes about 1-2 days.
Software Dependencies Yes We used Open AI API to call LLM inference in our experiments. Table 6 shows the API used for each method. We did most of our experiments from 2023/07 to 2023/09. We use the official implementations and prompts released by the authors 234. We spent about $3.6K for the experiments in total. Models API Cost (input/output; /1K tokens) Context Length RCI (Kim et al., 2023) gpt-3.5-turbo $0.0015 / $0.002 4K tokens gpt-4 $0.03 / $0.06 8K tokens Ada Planner (Sun et al., 2023) gpt-3.5-turbo $0.0015 / $0.002 4K tokens text-davinci-003 $0.02 / $0.02 4K tokens Synapse (Zheng et al., 2023) gpt-3.5-turbo $0.0015 / $0.002 4K tokens gpt-4 $0.03 / $0.06 8K tokens Table 6: List of LLM API used in this paper. We did those experiments from 2023/07 to 2023/09. We finetune HTML-T5-XL (Gur et al., 2023), a pre-trained language model with local and global attention in the encoder and a mixture of long-span denoising, on these rebalanced datasets.
Experiment Setup Yes We then balance the number of episodes based on the task difficulty, where we gradually reduce the ratio of easier tasks to focus more on challenging tasks. For instance, we remove X% episodes from top-k tasks in easy group. We heuristically design the following data-mixing strategies: Removing 50% episodes from top-10 easy tasks (424K 73K = 351K episodes) Removing 80% episodes from top-10 easy tasks (424K 102K = 322K episodes) Removing 50% episodes from easy tasks (424K -142K = 282K episodes) Removing 80% episodes from top-15 easy tasks and removing 50% episodes from other 11 easy tasks (424K 183K = 241K episodes)