reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Matryoshka Multimodal Models

Authors: Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we first detail the experiment settings in Sec 4.1. Then we show the performance of M3 on both image-level benchmarks 4.2 and video-level benchmarks 4.3. Finally, we analyze the behavior of Matryoshka Multimodal Models and provide ablations in Sec 4.4 and 4.5. ... Table 1: Comparison between LLa VA-1.5-M3 across various benchmarks under image understanding benchmarks. ... Table 2: Comparison of approaches with the SS baseline and M3 across various benchmarks under LLa VA-NeXT (Liu et al., 2024b).
Researcher Affiliation	Collaboration	Mu Cai1 Jianwei Yang2 Jianfeng Gao2 Yong Jae Lee1 1University of Wisconsin-Madison 2Microsoft Research, Redmond
Pseudocode	No	The paper describes the methodology using mathematical equations (3.1, 3.2) and conceptual diagrams (Figure 3) but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	To enable further research on adaptive LMMs that learn diverse information granularities, we publicly release our code and models. ... We have publicly released our code, data, and pretrained models, so that the community can fully reproduce, and build-upon, our work. ... https://matryoshka-mm.github.io/
Open Datasets	Yes	MMBench (Liu et al., 2023b), GQA (Hudson & Manning, 2019), POPE (Li et al., 2023c), Viz Wiz (Gurari et al., 2018), SEEDBench (Li et al., 2023a), Science QA (Lu et al., 2022), MMMU (Yue et al., 2024), and (b) document understanding/Optical character recognition (OCR) benchmarks: Doc VQA (Mathew et al., 2021), Chart QA (Masry et al., 2022), AI2D (Kembhavi et al., 2016) and Text VQA (Singh et al., 2019). For video understanding, we use both (a) open ended video question answering benchmarks evaluated by GPT-3.5: MSVD-QA (Xu et al., 2017), MSRVTT-QA (Xu et al., 2017) and Activity Net QA (Yu et al., 2019); and (b) multi-choice video question answering benchmarks: NExT-QA (Xiao et al., 2021), Intent QA (Li et al., 2023b), and Ego Schema (Mangalam et al., 2024). ... MSCOCO (Lin et al., 2014) validation set
Dataset Splits	Yes	Figure 1: Matryoshka Multimodal Models. ... The image is from MSCOCO (Lin et al., 2014) validation set and the captions are generated given 1, 9, and 576 tokens, respectively. ... We evaluate LLaVA-1.5-M3 on the common multimodal understanding and reasoning benchmarks. ... We finetune the whole model using the exact visual instruction data from LLaVA-1.5 and LLaVA-NeXT, respectively.
Hardware Specification	Yes	We train both models for 1 epoch using 8 NVIDIA H100 GPUs. ... The development device is Tesla V100 GPU, and time estimated by the roofline model represents the theoretical performance that the hardware can achieve.
Software Dependencies	No	We use LLaVA-1.5 (Liu et al., 2024a) and LLaVA-NeXT (Liu et al., 2024b) as the base LMMs, both with Vicuna 7B as the language model backbone. ... CLIP-ViT-L-336 (Radford et al., 2021) as the visual encoder.
Experiment Setup	Yes	The learning rate of LLM is 2 10 5 and 1 10 5, respectively for LLaVA-1.5 and LLaVA-NeXT. The learning rate for the visual encoder is 2 10 5 for both models. We train both models for 1 epoch using 8 NVIDIA H100 GPUs. ... We design 5 scales for the visual tokens. LLaVA-1.5 (Liu et al., 2024a) and LLaVA-NeXT (Liu et al., 2024b) both leverage CLIP-ViT-L-336 (Radford et al., 2021) as the visual encoder, where an image is embedded into 24 24 visual tokens. We gradually apply 2 2 pooling with stride 2, resulting in 12 12, 6 6, and 3 3 visual tokens, where we finally apply a 3 3 pooling to get the final single visual token. Therefore, the size of Matryoshka visual token sets are S {1, 9, 36, 144, 576}, following a nested manner.