Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters
Authors: Kevin Li, Sachin Goyal, João D Semedo, Zico Kolter
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first characterize this optimal tradeoff between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs is achieved by using the largest LLM that fits within the inference budget while minimizing visual token count often to a single token. ... We fit the proposed scaling law (Eq. 2) on {Y (N, T), N, T} pairs, with N {0.5, 1.8, 4, 7}B and T {1, 4, 16, 36, 64, 144, 576}. We use grid-search, for its stability (Goyal et al., 2024b), to estimate the scaling parameters α, β, A, B, and D. The final scaling law is evaluated on a N = 14B VLM model at various T visual tokens. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University, 2Bosch Center for Artificial Intelligence EMAIL EMAIL |
| Pseudocode | No | The paper describes algorithms in text (e.g., in Appendix B: "The following section details our updates to the existing token compression algorithm...") and provides a summarizing figure (Figure 7), but it does not present any structured pseudocode or algorithm blocks with numbered steps. |
| Open Source Code | No | The paper mentions using existing frameworks like LLaVA-Next and Qwen-1.5, but it does not contain an explicit statement about releasing the source code for the methodology described in this paper, nor does it provide a link to a code repository. |
| Open Datasets | Yes | To estimate the downstream error Y (N, C), we test our trained VLMs on a suite of nine commonly used benchmarks for evaluating visual reasoning: MME (Fu et al., 2024), GQA (Hudson & Manning, 2019), AI2D (Kembhavi et al., 2016), MMBench (Liu et al., 2024c), MMMU (Yue et al., 2023), Science QA (Lu et al., 2022), Math Vista (Lu et al., 2024), POPE (Li et al., 2023c), and Chart QA (Masry et al., 2022). |
| Dataset Splits | Yes | The pretraining and finetuning dataset and hyperparameters follow Liu et al. (2024a), except we doubled the effective batch size for finetuning. |
| Hardware Specification | No | The paper mentions "Bosch s compute" in the acknowledgements, but it does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions various models and frameworks like LLaVA-Next, Qwen-1.5, Vicuna 7B, and CLIP-ViT-L/14, but it does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x) that are necessary for reproducibility. |
| Experiment Setup | Yes | VLM Training and Evaluation: We use the LLa VA-Next framework (Liu et al., 2024b) to train VLMs with the Qwen-1.5 family of language models as the backbone. Specifically, we utilize the {0.5, 1.8, 4, 7, 14}B-chat models (Bai et al., 2023). The pretraining and finetuning dataset and hyperparameters follow Liu et al. (2024a), except we doubled the effective batch size for finetuning. ... Fitting Scaling Laws: We fit the proposed scaling law (Eq. 2) on {Y (N, T), N, T} pairs, with N {0.5, 1.8, 4, 7}B and T {1, 4, 16, 36, 64, 144, 576}. We use grid-search, for its stability (Goyal et al., 2024b), to estimate the scaling parameters α, β, A, B, and D. ... The grid-search range for each of the parameters were as follows: α, β {0, 0.1}, A, B, D {0, 1}. |