reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters

Authors: Kevin Li, Sachin Goyal, João D Semedo, Zico Kolter

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first characterize this optimal tradeoff between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs is achieved by using the largest LLM that fits within the inference budget while minimizing visual token count often to a single token. ... We fit the proposed scaling law (Eq. 2) on {Y (N, T), N, T} pairs, with N {0.5, 1.8, 4, 7}B and T {1, 4, 16, 36, 64, 144, 576}. We use grid-search, for its stability (Goyal et al., 2024b), to estimate the scaling parameters α, β, A, B, and D. The final scaling law is evaluated on a N = 14B VLM model at various T visual tokens.
Researcher Affiliation	Collaboration	1Carnegie Mellon University, 2Bosch Center for Artificial Intelligence EMAIL EMAIL
Pseudocode	No	The paper describes algorithms in text (e.g., in Appendix B: "The following section details our updates to the existing token compression algorithm...") and provides a summarizing figure (Figure 7), but it does not present any structured pseudocode or algorithm blocks with numbered steps.
Open Source Code	No	The paper mentions using existing frameworks like LLaVA-Next and Qwen-1.5, but it does not contain an explicit statement about releasing the source code for the methodology described in this paper, nor does it provide a link to a code repository.
Open Datasets	Yes	To estimate the downstream error Y (N, C), we test our trained VLMs on a suite of nine commonly used benchmarks for evaluating visual reasoning: MME (Fu et al., 2024), GQA (Hudson & Manning, 2019), AI2D (Kembhavi et al., 2016), MMBench (Liu et al., 2024c), MMMU (Yue et al., 2023), Science QA (Lu et al., 2022), Math Vista (Lu et al., 2024), POPE (Li et al., 2023c), and Chart QA (Masry et al., 2022).
Dataset Splits	Yes	The pretraining and finetuning dataset and hyperparameters follow Liu et al. (2024a), except we doubled the effective batch size for finetuning.
Hardware Specification	No	The paper mentions "Bosch s compute" in the acknowledgements, but it does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions various models and frameworks like LLaVA-Next, Qwen-1.5, Vicuna 7B, and CLIP-ViT-L/14, but it does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x) that are necessary for reproducibility.
Experiment Setup	Yes	VLM Training and Evaluation: We use the LLa VA-Next framework (Liu et al., 2024b) to train VLMs with the Qwen-1.5 family of language models as the backbone. Specifically, we utilize the {0.5, 1.8, 4, 7, 14}B-chat models (Bai et al., 2023). The pretraining and finetuning dataset and hyperparameters follow Liu et al. (2024a), except we doubled the effective batch size for finetuning. ... Fitting Scaling Laws: We fit the proposed scaling law (Eq. 2) on {Y (N, T), N, T} pairs, with N {0.5, 1.8, 4, 7}B and T {1, 4, 16, 36, 64, 144, 576}. We use grid-search, for its stability (Goyal et al., 2024b), to estimate the scaling parameters α, β, A, B, and D. ... The grid-search range for each of the parameters were as follows: α, β {0, 0.1}, A, B, D {0, 1}.