reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web

Authors: Hiroki Furuta, Yutaka Matsuo, Aleksandra Faust, Izzeddin Gur

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we introduce a new benchmark, called Comp Wo B 50 new compositional web automation tasks reﬂecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (ﬁnetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%.
Researcher Affiliation	Collaboration	Hiroki Furuta1,2 Yutaka Matsuo2 Aleksandra Faust1 Izzeddin Gur1 EMAIL EMAIL 1Google Deep Mind 2The University of Tokyo
Pseudocode	Yes	To clarify their methodological diﬀerence, we provide the pseudo code in Appendix B. Moreover, we resolve the sub-optimal performance of transferred LMAs (Section 4.4) with data-rebalanced ﬁnetuning (Section 4.5).
Open Source Code	Yes	We develop Comp Wo B 1 , simulated web environments for LMAs to measure the generalization to the realistic task compositionality and complex instructions. 1https://github.com/google-research/google-research/tree/master/compositional_rl
Open Datasets	Yes	We ﬁrst design a new controlled test bed, called Comp Wo B, with 50 compositional tasks by combining a set of base tasks based on their diﬃculty (Figure 1, right). Comp Wo B works as a hold-out test environment accompanying with Mini Wo B as a train environment (Figure 1, left).
Dataset Splits	Yes	Comp Wo B works as a hold-out test environment accompanying with Mini Wo B as a train environment (Figure 1, left). Each compositional task is implemented from 2 to 8 base tasks in a single-page or multi-page environment with instructions linked together using simple connectors such as and then . Only providing the knowledge about base tasks, we investigate the generalization performance of existing So TA prompted LMAs (Kim et al., 2023; Sun et al., 2023; Zheng et al., 2023) with planning, self-improvement, program synthesis, and structured prompts that are supported by gpt-3.5-turbo and gpt-4. Our ﬁndings indicate that their performance drops signiﬁcantly, from 94.0% success on base tasks to 24.9% success on compositional tasks. In contrast, small-scale LMAs ﬁnetuned only on base tasks and zero-shot-transferred to compositional settings (i.e. transferred LMAs), deal with unknown task compositionality better, achieving 54.8% success rate on average. By rebalancing the data distribution, we train a new model, HTML-T5++, that achieves human-level performance on Mini Wo B and performs the best among all the LMAs on compositional tasks. In contrast, small-scale LMAs ﬁnetuned only on base tasks and zero-shot-transferred to compositional settings (i.e. transferred LMAs), deal with unknown task compositionality better, achieving 54.8% success rate on average. By rebalancing the data distribution, we also train a new model, HTML-T5++, that achieves human-level performance on Mini Wo B and performs the best among all the LMAs on compositional tasks. We further point out that LMAs struggle to handle complex instruction compositions permuting the order of sub-instructions, where prompted agents are more robust to the diﬀerence in the order of compositions compared to transferred agents (6.9% vs 23.8% drop in performance). Finally, we illustrate that instruction length and observation complexity are useful indicators of compositional task performance. We hold out 21K episodes (5% of 347K + 77K = 424K) as a validation split, and after the convergence, we choose the top-5 checkpoints that achieve higher oﬄine validation accuracy, run those checkpoints online on Mini Wo B benchmarks, and then report the best success rate.
Hardware Specification	Yes	We have used cloud TPU-v3, which has a 32 Gi B HBM memory space, with 128 cores. Each ﬁnetuning experiment takes about 1-2 days.
Software Dependencies	Yes	We used Open AI API to call LLM inference in our experiments. Table 6 shows the API used for each method. We did most of our experiments from 2023/07 to 2023/09. We use the oﬃcial implementations and prompts released by the authors 234. We spent about $3.6K for the experiments in total. Models API Cost (input/output; /1K tokens) Context Length RCI (Kim et al., 2023) gpt-3.5-turbo $0.0015 / $0.002 4K tokens gpt-4 $0.03 / $0.06 8K tokens Ada Planner (Sun et al., 2023) gpt-3.5-turbo $0.0015 / $0.002 4K tokens text-davinci-003 $0.02 / $0.02 4K tokens Synapse (Zheng et al., 2023) gpt-3.5-turbo $0.0015 / $0.002 4K tokens gpt-4 $0.03 / $0.06 8K tokens Table 6: List of LLM API used in this paper. We did those experiments from 2023/07 to 2023/09. We ﬁnetune HTML-T5-XL (Gur et al., 2023), a pre-trained language model with local and global attention in the encoder and a mixture of long-span denoising, on these rebalanced datasets.
Experiment Setup	Yes	We then balance the number of episodes based on the task diﬃculty, where we gradually reduce the ratio of easier tasks to focus more on challenging tasks. For instance, we remove X% episodes from top-k tasks in easy group. We heuristically design the following data-mixing strategies: Removing 50% episodes from top-10 easy tasks (424K 73K = 351K episodes) Removing 80% episodes from top-10 easy tasks (424K 102K = 322K episodes) Removing 50% episodes from easy tasks (424K -142K = 282K episodes) Removing 80% episodes from top-15 easy tasks and removing 50% episodes from other 11 easy tasks (424K 183K = 241K episodes)