SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

Authors: John Yang, Carlos E Jimenez, Alex Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik Narasimhan, Diyi Yang, Sida Wang, Ofir Press

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We discover that existing systems perform significantly worse on SWE-bench M than they do on SWE-bench, due in large part to the challenges of visual problem solving and Java Script s diverse development practices. The performance results on SWE-bench M appear to vary depending on the diverse types of images and challenges presented. Different visual elements, such as code snippets, website screenshots, and diagrams, seem to require distinct comprehension abilities. Furthermore, Java Script s support for object oriented, functional, and procedural programming introduces substantial variance in how codebases are structured, which standardized solutions struggle with. We compare the performance of each baseline system in Table 3.
Researcher Affiliation Collaboration 1Stanford University 2Princeton Language & Intelligence, Princeton University 3Cornell University 4T ubingen AI Center, University of T ubingen 5Meta AI
Pseudocode No The paper describes methods and processes (e.g., collection process, adaptation of systems) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Data, code, and leaderboard at swebench.com/multimodal.
Open Datasets Yes Therefore, we propose SWE-bench Multimodal (SWE-bench M), to evaluate systems on their ability to fix bugs in visual, user-facing Java Script software. SWE-bench M features 617 task instances collected from 17 Java Script libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping. Data, code, and leaderboard at swebench.com/multimodal.
Dataset Splits Yes As shown in Figure 2, the test split consists of 517 task instances from 12 repositories. The development split contains 102 task instances from 5 repositories.
Hardware Specification No The paper mentions that experiments were conducted on 'Princeton University s servers' but does not provide specific details on CPU, GPU, or other hardware components used.
Software Dependencies Yes We focus all of our evaluations on GPT-4o (gpt-4o-2024-08-06) (Open AI, 2024) and Claude 3.5 Sonnet (claude-3-5-sonnet-20240620) (Anthropic, 2024), the two most well-supported mulitmodal LMs for long-context RAG and agent systems.
Experiment Setup Yes For the final configuration evaluated for each system, we perform a very small grid search over two different options for the number of past observations to show in {5, 9}. We run each configuration 5 times on the development set and select the configuration with the best mean performance. We report the results of this grid search in Table 15. For RAG systems, the amount of context we provide, either in terms of the number of documents or simply the absolute length of the context retrieved, is an important hyperparameter that may affect a model s performance. As in Jimenez et al. (2024a), we determine the final RAG system to evaluate by performing a grid search over three possible context lengths in {32K, 64K, 100K} and the inclusion or not of images as input with the problem statement.