Matryoshka Multimodal Models
Authors: Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we first detail the experiment settings in Sec 4.1. Then we show the performance of M3 on both image-level benchmarks 4.2 and video-level benchmarks 4.3. Finally, we analyze the behavior of Matryoshka Multimodal Models and provide ablations in Sec 4.4 and 4.5. ... Table 1: Comparison between LLa VA-1.5-M3 across various benchmarks under image understanding benchmarks. ... Table 2: Comparison of approaches with the SS baseline and M3 across various benchmarks under LLa VA-NeXT (Liu et al., 2024b). |
| Researcher Affiliation | Collaboration | Mu Cai1 Jianwei Yang2 Jianfeng Gao2 Yong Jae Lee1 1University of Wisconsin-Madison 2Microsoft Research, Redmond |
| Pseudocode | No | The paper describes the methodology using mathematical equations (3.1, 3.2) and conceptual diagrams (Figure 3) but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | To enable further research on adaptive LMMs that learn diverse information granularities, we publicly release our code and models. ... We have publicly released our code, data, and pretrained models, so that the community can fully reproduce, and build-upon, our work. ... https://matryoshka-mm.github.io/ |
| Open Datasets | Yes | MMBench (Liu et al., 2023b), GQA (Hudson & Manning, 2019), POPE (Li et al., 2023c), Viz Wiz (Gurari et al., 2018), SEEDBench (Li et al., 2023a), Science QA (Lu et al., 2022), MMMU (Yue et al., 2024), and (b) document understanding/Optical character recognition (OCR) benchmarks: Doc VQA (Mathew et al., 2021), Chart QA (Masry et al., 2022), AI2D (Kembhavi et al., 2016) and Text VQA (Singh et al., 2019). For video understanding, we use both (a) open ended video question answering benchmarks evaluated by GPT-3.5: MSVD-QA (Xu et al., 2017), MSRVTT-QA (Xu et al., 2017) and Activity Net QA (Yu et al., 2019); and (b) multi-choice video question answering benchmarks: NExT-QA (Xiao et al., 2021), Intent QA (Li et al., 2023b), and Ego Schema (Mangalam et al., 2024). ... MSCOCO (Lin et al., 2014) validation set |
| Dataset Splits | Yes | Figure 1: Matryoshka Multimodal Models. ... The image is from MSCOCO (Lin et al., 2014) validation set and the captions are generated given 1, 9, and 576 tokens, respectively. ... We evaluate LLaVA-1.5-M3 on the common multimodal understanding and reasoning benchmarks. ... We finetune the whole model using the exact visual instruction data from LLaVA-1.5 and LLaVA-NeXT, respectively. |
| Hardware Specification | Yes | We train both models for 1 epoch using 8 NVIDIA H100 GPUs. ... The development device is Tesla V100 GPU, and time estimated by the roofline model represents the theoretical performance that the hardware can achieve. |
| Software Dependencies | No | We use LLaVA-1.5 (Liu et al., 2024a) and LLaVA-NeXT (Liu et al., 2024b) as the base LMMs, both with Vicuna 7B as the language model backbone. ... CLIP-ViT-L-336 (Radford et al., 2021) as the visual encoder. |
| Experiment Setup | Yes | The learning rate of LLM is 2 10 5 and 1 10 5, respectively for LLaVA-1.5 and LLaVA-NeXT. The learning rate for the visual encoder is 2 10 5 for both models. We train both models for 1 epoch using 8 NVIDIA H100 GPUs. ... We design 5 scales for the visual tokens. LLaVA-1.5 (Liu et al., 2024a) and LLaVA-NeXT (Liu et al., 2024b) both leverage CLIP-ViT-L-336 (Radford et al., 2021) as the visual encoder, where an image is embedded into 24 24 visual tokens. We gradually apply 2 2 pooling with stride 2, resulting in 12 12, 6 6, and 3 3 visual tokens, where we finally apply a 3 3 pooling to get the final single visual token. Therefore, the size of Matryoshka visual token sets are S {1, 9, 36, 144, 576}, following a nested manner. |