Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web
Authors: Hiroki Furuta, Yutaka Matsuo, Aleksandra Faust, Izzeddin Gur
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we introduce a new benchmark, called Comp Wo B 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%. |
| Researcher Affiliation | Collaboration | Hiroki Furuta1,2 Yutaka Matsuo2 Aleksandra Faust1 Izzeddin Gur1 EMAIL EMAIL 1Google Deep Mind 2The University of Tokyo |
| Pseudocode | Yes | To clarify their methodological difference, we provide the pseudo code in Appendix B. Moreover, we resolve the sub-optimal performance of transferred LMAs (Section 4.4) with data-rebalanced finetuning (Section 4.5). |
| Open Source Code | Yes | We develop Comp Wo B 1 , simulated web environments for LMAs to measure the generalization to the realistic task compositionality and complex instructions. 1https://github.com/google-research/google-research/tree/master/compositional_rl |
| Open Datasets | Yes | We first design a new controlled test bed, called Comp Wo B, with 50 compositional tasks by combining a set of base tasks based on their difficulty (Figure 1, right). Comp Wo B works as a hold-out test environment accompanying with Mini Wo B as a train environment (Figure 1, left). |
| Dataset Splits | Yes | Comp Wo B works as a hold-out test environment accompanying with Mini Wo B as a train environment (Figure 1, left). Each compositional task is implemented from 2 to 8 base tasks in a single-page or multi-page environment with instructions linked together using simple connectors such as and then . Only providing the knowledge about base tasks, we investigate the generalization performance of existing So TA prompted LMAs (Kim et al., 2023; Sun et al., 2023; Zheng et al., 2023) with planning, self-improvement, program synthesis, and structured prompts that are supported by gpt-3.5-turbo and gpt-4. Our findings indicate that their performance drops significantly, from 94.0% success on base tasks to 24.9% success on compositional tasks. In contrast, small-scale LMAs finetuned only on base tasks and zero-shot-transferred to compositional settings (i.e. transferred LMAs), deal with unknown task compositionality better, achieving 54.8% success rate on average. By rebalancing the data distribution, we train a new model, HTML-T5++, that achieves human-level performance on Mini Wo B and performs the best among all the LMAs on compositional tasks. In contrast, small-scale LMAs finetuned only on base tasks and zero-shot-transferred to compositional settings (i.e. transferred LMAs), deal with unknown task compositionality better, achieving 54.8% success rate on average. By rebalancing the data distribution, we also train a new model, HTML-T5++, that achieves human-level performance on Mini Wo B and performs the best among all the LMAs on compositional tasks. We further point out that LMAs struggle to handle complex instruction compositions permuting the order of sub-instructions, where prompted agents are more robust to the difference in the order of compositions compared to transferred agents (6.9% vs 23.8% drop in performance). Finally, we illustrate that instruction length and observation complexity are useful indicators of compositional task performance. We hold out 21K episodes (5% of 347K + 77K = 424K) as a validation split, and after the convergence, we choose the top-5 checkpoints that achieve higher offline validation accuracy, run those checkpoints online on Mini Wo B benchmarks, and then report the best success rate. |
| Hardware Specification | Yes | We have used cloud TPU-v3, which has a 32 Gi B HBM memory space, with 128 cores. Each finetuning experiment takes about 1-2 days. |
| Software Dependencies | Yes | We used Open AI API to call LLM inference in our experiments. Table 6 shows the API used for each method. We did most of our experiments from 2023/07 to 2023/09. We use the official implementations and prompts released by the authors 234. We spent about $3.6K for the experiments in total. Models API Cost (input/output; /1K tokens) Context Length RCI (Kim et al., 2023) gpt-3.5-turbo $0.0015 / $0.002 4K tokens gpt-4 $0.03 / $0.06 8K tokens Ada Planner (Sun et al., 2023) gpt-3.5-turbo $0.0015 / $0.002 4K tokens text-davinci-003 $0.02 / $0.02 4K tokens Synapse (Zheng et al., 2023) gpt-3.5-turbo $0.0015 / $0.002 4K tokens gpt-4 $0.03 / $0.06 8K tokens Table 6: List of LLM API used in this paper. We did those experiments from 2023/07 to 2023/09. We finetune HTML-T5-XL (Gur et al., 2023), a pre-trained language model with local and global attention in the encoder and a mixture of long-span denoising, on these rebalanced datasets. |
| Experiment Setup | Yes | We then balance the number of episodes based on the task difficulty, where we gradually reduce the ratio of easier tasks to focus more on challenging tasks. For instance, we remove X% episodes from top-k tasks in easy group. We heuristically design the following data-mixing strategies: Removing 50% episodes from top-10 easy tasks (424K 73K = 351K episodes) Removing 80% episodes from top-10 easy tasks (424K 102K = 322K episodes) Removing 50% episodes from easy tasks (424K -142K = 282K episodes) Removing 80% episodes from top-15 easy tasks and removing 50% episodes from other 11 easy tasks (424K 183K = 241K episodes) |