VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Authors: Lawrence Jang, Yinheng Li, Dan Zhao, Charles Ding, Justin Lin, Paul Pu Liang, Rogerio Bonatti, Kazuhito Koishida

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Video Web Arena (Video WA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. Video WA consists of 2,021 web agent tasks... We find that the best model achieves a 13.3% success rate on factual retention tasks... Our results show that video/image-capable agents are still limited and far from human levels of performance, highlighting a considerable gap in the information retrieval and agentic abilities of current state-of-the-art long-context models.
Researcher Affiliation Collaboration Lawrence Jang13, Yinheng Li3, Dan Zhao23 Charles Ding1, Justin Lin1, Paul Pu Liang2, Rogerio Bonatti3, Kazuhito Koishida3 1Carnegie Mellon University, 2Massachusetts Institute of Technology, 3Microsoft
Pseudocode No The paper describes action types in Table 4 and agent prompts in Appendix E, but it does not include structured pseudocode or a clearly labeled algorithm block for the methodology described.
Open Source Code Yes Our code is will be opensourced and available on Github. An anonymized version of our code base can be found at https://anonymous.4open.science/r/videowebarena-236E/README.md. Our videos will also be made available through Google Drive and You Tube.1Link to code https://github.com/ljang0/videowebarena/
Open Datasets Yes We introduce Video Web Arena (Video WA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. Video WA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content... We provide our videos online through a You Tube channel and a Google Drive link containing the zip file of all the videos. ... Link to video https://www.youtube.com/@webarenawarrior or https://drive.google.com/file/d/17Dwms M7Kz BWyz1BN1aq7NHDvgc TIr Cgx/view?usp=drive link.
Dataset Splits No The paper introduces Video Web Arena as a benchmark for evaluating agents and categorizes tasks by type and difficulty. However, it does not describe specific training, validation, or test dataset splits for training a model, but rather for evaluating pre-existing models on its set of tasks.
Hardware Specification No The paper evaluates existing models like GPT-4o and Gemini 1.5 Pro but does not provide specific details about the hardware used to conduct these evaluations.
Software Dependencies No The paper mentions using Open AI's Whisper for audio transcription and Playwright Python code, but it does not provide specific version numbers for these or any other software dependencies crucial for replication.
Experiment Setup Yes We evaluate our benchmark using three different types of baseline agents with multimodal models as a backbone. Each type is distinguished by the type of video input provided to the model/agent. At each step, the agent is given the task objective, 2 in-context examples, current state s, and the input video to the objective as context to generate one action... We sample 1 frame per second (max 60 frames) for the video and include them into the context for the LLM. In addition, we use Open AI s Whisper (Radford et al., 2022) to transcribe the audio and append it to the context... The specific prompts we use are in Appendix E.