WEPO: Web Element Preference Optimization for LLM-based Web Navigation
Authors: Jiarun Liu, Jia Hao, Chunhong Zhang, Zheng Hu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate WEPO on the Mind2Web benchmark and empirically demonstrate that WEPO aligns user high-level intent with output actions more effectively. The results show that our method achieved the state-of-the-art, with an improvement of 13.8% over Web Agent and 5.3% over the visual language model Cog Agent baseline. Our experiments on multiple mainstream open-sourced models demonstrate that our WEPO significantly outperforms traditional supervised fine-tuning (SFT) methods, exceeding the Mind Act (Deng et al. 2024) baseline by 20.0% and Web Agent (Gur et al. 2023) by 13.8%. |
| Researcher Affiliation | Academia | State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications EMAIL |
| Pseudocode | Yes | We demonstrate the pseudo-code for WEPO implementation in Algorithm 1, which illustrates how aw, al, and x used for optimizing are obtained at each training step. Algorithm 1: WEPO Algorithm |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the code for the methodology described, nor does it provide a direct link to a code repository. It mentions using 'mainstream open-sourced models' and links to third-party models (e.g., 'https://llama.meta.com/llama3/') but not its own implementation. |
| Open Datasets | Yes | For experiments, we selected the Mind2Web dataset due to its high task diversity and realistic web scenarios, which best validate the capabilities of fine-tuned LLM agents. Benchmarks in web navigation have evolved rapidly from the simplified Mini Wo B (Shi et al. 2017) to the advanced Mind2Web (Deng et al. 2024) and other alternatives (L u, Kasner, and Reddy 2024; Zhou et al. 2023; He et al. 2024). |
| Dataset Splits | Yes | We thoroughly evaluate WEPO on the partitioned three-tier held-out test sets in Mind2Web (Deng et al. 2024), including cross-domain, cross-website and cross-task datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory amounts, or cloud instances used for running the experiments. It only discusses the models and techniques used without specifying the underlying hardware. |
| Software Dependencies | No | The paper mentions specific LLM models (e.g., Llama-3-8B, Mistral-7B-Instructv0.1, Gemma-2B) but does not provide version numbers for ancillary software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version) that would be needed to replicate the experiment. |
| Experiment Setup | Yes | For the hyperparameters of WEPO, we set the deviation parameter β to 0.95 and the negative sample ratio to 1:3. All models were configured with a maximum context length of 8192 tokens. The learning rate is set at 0.0001, and we use a combination of learning rate warmup and a cosine decay strategy for training. We set the number of negative samples as n, corresponding to a positive-to-negative sample ratio of 1 : n. We similarly select a pruning ratio of k = 50 to maintain consistency for comparison, preserving 50 central elements and their neighboring elements tagged with element ID. |