reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Authors: Lawrence Jang, Yinheng Li, Dan Zhao, Charles Ding, Justin Lin, Paul Pu Liang, Rogerio Bonatti, Kazuhito Koishida

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Video Web Arena (Video WA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. Video WA consists of 2,021 web agent tasks... We find that the best model achieves a 13.3% success rate on factual retention tasks... Our results show that video/image-capable agents are still limited and far from human levels of performance, highlighting a considerable gap in the information retrieval and agentic abilities of current state-of-the-art long-context models.
Researcher Affiliation	Collaboration	Lawrence Jang13, Yinheng Li3, Dan Zhao23 Charles Ding1, Justin Lin1, Paul Pu Liang2, Rogerio Bonatti3, Kazuhito Koishida3 1Carnegie Mellon University, 2Massachusetts Institute of Technology, 3Microsoft
Pseudocode	No	The paper describes action types in Table 4 and agent prompts in Appendix E, but it does not include structured pseudocode or a clearly labeled algorithm block for the methodology described.
Open Source Code	Yes	Our code is will be opensourced and available on Github. An anonymized version of our code base can be found at https://anonymous.4open.science/r/videowebarena-236E/README.md. Our videos will also be made available through Google Drive and You Tube.1Link to code https://github.com/ljang0/videowebarena/
Open Datasets	Yes	We introduce Video Web Arena (Video WA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. Video WA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content... We provide our videos online through a You Tube channel and a Google Drive link containing the zip file of all the videos. ... Link to video https://www.youtube.com/@webarenawarrior or https://drive.google.com/file/d/17Dwms M7Kz BWyz1BN1aq7NHDvgc TIr Cgx/view?usp=drive link.
Dataset Splits	No	The paper introduces Video Web Arena as a benchmark for evaluating agents and categorizes tasks by type and difficulty. However, it does not describe specific training, validation, or test dataset splits for training a model, but rather for evaluating pre-existing models on its set of tasks.
Hardware Specification	No	The paper evaluates existing models like GPT-4o and Gemini 1.5 Pro but does not provide specific details about the hardware used to conduct these evaluations.
Software Dependencies	No	The paper mentions using Open AI's Whisper for audio transcription and Playwright Python code, but it does not provide specific version numbers for these or any other software dependencies crucial for replication.
Experiment Setup	Yes	We evaluate our benchmark using three different types of baseline agents with multimodal models as a backbone. Each type is distinguished by the type of video input provided to the model/agent. At each step, the agent is given the task objective, 2 in-context examples, current state s, and the input video to the objective as context to generate one action... We sample 1 frame per second (max 60 frames) for the video and include them into the context for the LLM. In addition, we use Open AI s Whisper (Radford et al., 2022) to transcribe the audio and append it to the context... The specific prompts we use are in Appendix E.