Zoomer: Adaptive Image Focus Optimization for Black-box MLLM
Authors: Jiaxu Qian, Chendong Wang, Yifan Yang, Chaoyun Zhang, Huiqiang Jiang, Xufang Luo, Yu Kang, Qingwei Lin, Anlan Zhang, Shiqi Jiang, Ting Cao, Tianjun Mao, Suman Banerjee, Guyue Liu, Saravan Rajmohan, Dongmei Zhang, Yuqing Yang, Qi Zhang, Lili Qiu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on nine benchmarks and three commercial MLLMs demonstrate that Zoomer boosts accuracy by up to 27% while cutting image token usage by up to 67%. Our approach establishes a principled methodology for robust, resource-aware multimodal understanding in settings where model internals are inaccessible. ... We evaluate Zoomer on a suite of diverse benchmarks, including Vstar(Wu & Xie, 2023), CVBench(Tong et al., 2024a), and Real World QA (x AI, 2024) ... We conducted a comprehensive ablation study along two primary dimensions: (1) the role of multi-scale visual emphasis strategies, and (2) the impact of different Ro I localization models. The results are summarized in Table 5. |
| Researcher Affiliation | Collaboration | Anonymous authors Paper under double-blind review. The paper does not provide explicit author affiliations, as it is under double-blind review. |
| Pseudocode | Yes | Algorithm 1 Multi-Scale Emphasizing Algorithm Algorithm 2 NMS-based Slice Filtering |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It mentions open-source models as a basis for comparison or for future work, but not for the implementation of Zoomer itself. |
| Open Datasets | Yes | We evaluate Zoomer on a suite of diverse benchmarks, including Vstar(Wu & Xie, 2023), CVBench(Tong et al., 2024a), and Real World QA (x AI, 2024), which collectively capture controlled, open-domain, and in-the-wild reasoning scenarios. ... The experiments were conducted on a variety of different public datasets, including: 1) Vstar (Wu & Xie, 2023): A benchmark dataset focused on image classification, used to evaluate fine-grained visual recognition capabilities in object detection and classification tasks. ... 5) Science QA (Lu et al., 2022): A multimodal scientific question-answering dataset featuring multiple-choice questions across a diverse range of science topics. 6) MMMU Yue et al. (2024): The validation part of a new benchmark... 7) HR Wang et al. (2024): A high-resolution multimodal benchmark consisting of 4K and 8K images and corresponding questions. |
| Dataset Splits | Yes | To quantify the impact of this trade-off, we conducted a series of pilot experiments using GPT-4o-0513 on the Vstar-Bench dataset, which requires precise visual grounding of small or occluded objects in complex scenes. ... The threshold 0.35 is chosen empirically on the VSTAR validation set to balance accuracy and token efficiency. ... We observe that TA = 0.35 achieves the best trade-off between accuracy and token efficiency: smaller thresholds append the global view too often, increasing token usage with marginal accuracy gains, while larger thresholds skip the global view in cases that benefit from additional context. |
| Hardware Specification | No | The paper mentions using commercial APIs (GPT-4o, Gemini-1.5Pro, Claude-3.5-Sonnet) and open-source models (Qwen2-VL-7B, Intern VL2.5-8B), but it does not specify the hardware (e.g., specific GPUs, CPUs, or memory) used by the authors for their experiments or running these models. |
| Software Dependencies | No | To enhance the extraction of semantically relevant tokens, we apply advanced natural language processing (NLP) techniques. First, we use the NLTK library5 to remove stopwords... In our experiments, we primarily employ Grounding DINO (Liu et al., b) as our localization model. ... We employed three black-box MLLMs GPT-4o-0513, Claude-v3-Sonnet, and Gemini Pro accessed via their respective APIs (Open AI, Claude, Google). The paper mentions software tools and APIs but does not provide specific version numbers for any of the libraries or environments used for reproducibility, except for the API model versions like GPT-4o-0513 which are service versions rather than locally installed software dependencies with versions. |
| Experiment Setup | Yes | Across all experiments, we set the temperature to 0 and used greedy decoding for consistency, optimizing the stability of outputs. NMS was applied with a confidence score threshold of 0.8 to filter irrelevant regions from high-resolution images. ... The threshold 0.35 is chosen empirically on the VSTAR validation set to balance accuracy and token efficiency. |