WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Authors: Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Jiadai Sun, Xinyue Yang, Yu Yang, Shuntian Yao, Wei Xu, Jie Tang, Yuxiao Dong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply WEBRL to transform Llama-3.1 models into proficient web agents, achieving remarkable results on the Web Arena-Lite benchmark. Our Llama-3.1-8B agent improves from an initial 4.8% success rate to 42.4%, while the Llama-3.1-70B agent achieves a 47.3% success rate across five diverse websites. These results surpass the performance of GPT-4-Turbo (17.6%) by over 160% relatively and significantly outperform previous state-of-the-art web agents trained on open LLMs (Auto Web GLM, 18.2%). Our findings demonstrate WEBRL s effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.
Researcher Affiliation Collaboration 1Tsinghua University 2Zhipu AI 3Shanghai Qi Zhi Institute
Pseudocode Yes The overall training process can be found in Algorithm 1. ... Algorithm 1 WEBRL Input: Web Arena-Lite training set D0 and corresponding instruction set I0, Base model Mbase, Replay buffer B, Failure set F, Reward model MORM, Number of phases N, Number of instructions per phase K Part 1. SFT Training 1: MSFT perform supervised fine-tuning on model Mbase with dataset D0 2: Drollout rollout trajectories of MSFT on I0 3: Dsuccess, Dfail evaluate Drollout using Web Arena-Lite reward function 4: B D0 # Initialize replay buffer with successful trajectories 5: F Dfail # Initialize failure set with instructions of failing trajectories Part 2. Self-evolving Curriculum RL 6: π1 MSFT # Initialize actor/policy 7: V1 MSFT with a randomly initialized value head # Initialize critic 8: for n in 1...N do 9: In {} 10: while size(In) K do 11: Igeneration instructions generated by GPT-4o with instructions from F as examples # instruction generation process 12: Ifilter filter(Igeneration) # instruction filtering process 13: In In Ifilter 14: end while 15: Drollout rollout trajectories of πn on In 16: Dsuccess, Dfail evaluate Drollout with MORM # use MORM to label the rollout trajectories 17: Dexperience experiences from B with perplexity computed by πn between 1 0.95 and 1 0.5 18: πn+1,Vn+1 train(πn, Vn, Drollout Dexperience) # use loss functions from eq. 4 and eq. 7 19: B B Dsuccess 20: F F Dfail 21: end for
Open Source Code Yes Code, model, and data will be available at https://github.com/THUDM/Web RL.
Open Datasets Yes The effectiveness of our method and the baseline models is evaluated using the Web Arena environment (Zhou et al., 2024a). Web Arena is particularly well-suited to our needs, as it provides a highly interactive platform that supports online learning. Additionally, Web Arena encompasses a variety of websites, including Open Street Map (Map), Reddit, Git Lab, online store content management system (CMS), and One Stop Shop (OSS), making it an ideal benchmark for comprehensively assessing model performance on web tasks. In the original Web Arena environment, a total of 812 instructions are provided. Considering the cost of testing, we use 165 test cases from Web Arena-Lite (Liu et al., 2024c) for evaluation.
Dataset Splits Yes Considering the cost of testing, we use 165 test cases from Web Arena-Lite (Liu et al., 2024c) for evaluation. ... Web Arena-Lite (Liu et al., 2024c) provides training samples along with a corresponding reward function. ... First, we perform supervised fine-tuning using the Web Arena-Lite training dataset.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory configurations used for running the experiments. It focuses on the methodology and results without detailing the computational infrastructure.
Software Dependencies No The paper mentions using "Llama-3.1 models" but does not specify versions of any underlying software libraries, frameworks (e.g., PyTorch, TensorFlow), or other dependencies with version numbers required for reproduction.
Experiment Setup Yes The hyperparameters employed in WEBRL are presented in Table 5. ... Table 5: The hyperparameters we employ in WEBRL and baselines. Method Hyperparameter Value SFT learning rate 1e-5 lr scheduler type cosine warmup ratio 0.1 batch size 128 training epoch 1 cutoff length 16384 ... [and similar details for other methods]