reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

WEPO: Web Element Preference Optimization for LLM-based Web Navigation

Authors: Jiarun Liu, Jia Hao, Chunhong Zhang, Zheng Hu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate WEPO on the Mind2Web benchmark and empirically demonstrate that WEPO aligns user high-level intent with output actions more effectively. The results show that our method achieved the state-of-the-art, with an improvement of 13.8% over Web Agent and 5.3% over the visual language model Cog Agent baseline. Our experiments on multiple mainstream open-sourced models demonstrate that our WEPO significantly outperforms traditional supervised fine-tuning (SFT) methods, exceeding the Mind Act (Deng et al. 2024) baseline by 20.0% and Web Agent (Gur et al. 2023) by 13.8%.
Researcher Affiliation	Academia	State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications EMAIL
Pseudocode	Yes	We demonstrate the pseudo-code for WEPO implementation in Algorithm 1, which illustrates how aw, al, and x used for optimizing are obtained at each training step. Algorithm 1: WEPO Algorithm
Open Source Code	No	The paper does not contain an explicit statement about releasing the code for the methodology described, nor does it provide a direct link to a code repository. It mentions using 'mainstream open-sourced models' and links to third-party models (e.g., 'https://llama.meta.com/llama3/') but not its own implementation.
Open Datasets	Yes	For experiments, we selected the Mind2Web dataset due to its high task diversity and realistic web scenarios, which best validate the capabilities of fine-tuned LLM agents. Benchmarks in web navigation have evolved rapidly from the simplified Mini Wo B (Shi et al. 2017) to the advanced Mind2Web (Deng et al. 2024) and other alternatives (L u, Kasner, and Reddy 2024; Zhou et al. 2023; He et al. 2024).
Dataset Splits	Yes	We thoroughly evaluate WEPO on the partitioned three-tier held-out test sets in Mind2Web (Deng et al. 2024), including cross-domain, cross-website and cross-task datasets.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory amounts, or cloud instances used for running the experiments. It only discusses the models and techniques used without specifying the underlying hardware.
Software Dependencies	No	The paper mentions specific LLM models (e.g., Llama-3-8B, Mistral-7B-Instructv0.1, Gemma-2B) but does not provide version numbers for ancillary software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version) that would be needed to replicate the experiment.
Experiment Setup	Yes	For the hyperparameters of WEPO, we set the deviation parameter β to 0.95 and the negative sample ratio to 1:3. All models were configured with a maximum context length of 8192 tokens. The learning rate is set at 0.0001, and we use a combination of learning rate warmup and a cosine decay strategy for training. We set the number of negative samples as n, corresponding to a positive-to-negative sample ratio of 1 : n. We similarly select a pruning ratio of k = 50 to maintain consistency for comparison, preserving 50 central elements and their neighboring elements tagged with element ID.