HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding
Authors: Rui Yang, Lin Song, Yicheng Xiao, Runhui Huang, Yixiao Ge, Ying Shan, Hengshuang Zhao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we first outline the experimental setup including training settings and dataset. Then, we compare our Haplo VL with leading methods on various benchmarks. Finally, an analysis of training procedures and some qualitative results are given at the end of this section. |
| Researcher Affiliation | Collaboration | 1The University of Hong Kong 2ARC Lab, Tencent PCG 3Tsinghua University. |
| Pseudocode | Yes | Algorithm 1 Current Token Prediction Loss def loss(hidden_state, target_ids, embed_tokens): """ hidden_state: [N, C] target_ids: [N] embed_tokens: [K, C] # N: the sequence length # K: the vocabulary size # C: the dimension of hidden_state """ # get logits logits = hidden_state @ embed_tokens.transpose(-2, -1) # [N, K] # scale logits logits = logit_scale * logits # calculate loss loss = F.cross_entropy(logits, target_ids) return loss |
| Open Source Code | Yes | Code is available at https: //github.com/Tencent/Haplo VLM. |
| Open Datasets | Yes | The data is mainly from LLa VA (Liu et al., 2024a; Li et al., 2024a), dolphin (Computations, 2023), CC3M (Changpinyo et al., 2021), and COCO (Lin et al., 2014). Evaluation data. Moreover, Haplo VL is evaluated on widely adopted image-based benchmarks including GQA (Hudson & Manning, 2019), VQAv2 (Goyal et al., 2017), Science QA-IMG (SQA) (Lu et al., 2022b), AI2D (Kembhavi et al., 2016), MMBench-EN-dev (MMB) (Liu et al., 2024d), MMMU (Yue et al., 2024), Real World QA (x.ai, 2024), MMStar (MMS) (Chen et al., 2024a), POPE (Li et al., 2023c), SEED-Bench-IMG (SEED) (Li et al., 2023a), and MMVP (Tong et al., 2024). |
| Dataset Splits | Yes | In terms of the data, all models are trained on 665 K plus 558 K multi-modal samples from LLa VA-1.5 (Liu et al., 2024a) if there is no other statement. During the fully fine-tuning stage, ... Regarding the data, our best model is optimized on the 4 M visual instruction data for 1 epoch ( 30K steps). For Haplo VL-7B, ... we first tune the connector between the pre-decoder and post-decoder using 558 K caption data and then fully tune the model using 665 K instruction data. For Haplo VL-8B with the ability to input any resolution, we first tune the whole model using 1.2 M caption data (Chen et al., 2023) and then tune the model using 4 M instruction data (Li et al., 2024a). For the models that support the multi-image and video input, we continue training the single-image model using the mix of interleaved data and single-image data. For the ablation experiments, the models are optimized on the 0.6 M visual instruction data for 5 K steps. |
| Hardware Specification | Yes | All models are optimized using the Adam W (Loshchilov & Hutter, 2019) optimizer and cosine scheduler on 32 GPUs with 64GB per-device memory. |
| Software Dependencies | No | The paper mentions 'Py Torch-like pseudo code' for the current token prediction loss and 'Adam W' for the optimizer, but does not provide specific version numbers for PyTorch or any other software libraries used. |
| Experiment Setup | Yes | During the pre-training stage, we optimize the postdecoder for 40 K steps with 1 e 4 learning rate, a batch size of 256, and 2 K warm-up steps. In terms of the data, all models are trained on 665 K plus 558 K multi-modal samples from LLa VA-1.5 (Liu et al., 2024a) if there is no other statement. During the fully fine-tuning stage, the learning rate is set to 2 e 5 and batch size to 128. Regarding the data, our best model is optimized on the 4 M visual instruction data for 1 epoch ( 30K steps). All models are optimized using the Adam W (Loshchilov & Hutter, 2019) optimizer and cosine scheduler on 32 GPUs with 64GB per-device memory. More details are recorded in the Appendix. |