reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

Authors: Rui Yang, Lin Song, Yicheng Xiao, Runhui Huang, Yixiao Ge, Ying Shan, Hengshuang Zhao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we first outline the experimental setup including training settings and dataset. Then, we compare our Haplo VL with leading methods on various benchmarks. Finally, an analysis of training procedures and some qualitative results are given at the end of this section.
Researcher Affiliation	Collaboration	1The University of Hong Kong 2ARC Lab, Tencent PCG 3Tsinghua University.
Pseudocode	Yes	Algorithm 1 Current Token Prediction Loss def loss(hidden_state, target_ids, embed_tokens): """ hidden_state: [N, C] target_ids: [N] embed_tokens: [K, C] # N: the sequence length # K: the vocabulary size # C: the dimension of hidden_state """ # get logits logits = hidden_state @ embed_tokens.transpose(-2, -1) # [N, K] # scale logits logits = logit_scale * logits # calculate loss loss = F.cross_entropy(logits, target_ids) return loss
Open Source Code	Yes	Code is available at https: //github.com/Tencent/Haplo VLM.
Open Datasets	Yes	The data is mainly from LLa VA (Liu et al., 2024a; Li et al., 2024a), dolphin (Computations, 2023), CC3M (Changpinyo et al., 2021), and COCO (Lin et al., 2014). Evaluation data. Moreover, Haplo VL is evaluated on widely adopted image-based benchmarks including GQA (Hudson & Manning, 2019), VQAv2 (Goyal et al., 2017), Science QA-IMG (SQA) (Lu et al., 2022b), AI2D (Kembhavi et al., 2016), MMBench-EN-dev (MMB) (Liu et al., 2024d), MMMU (Yue et al., 2024), Real World QA (x.ai, 2024), MMStar (MMS) (Chen et al., 2024a), POPE (Li et al., 2023c), SEED-Bench-IMG (SEED) (Li et al., 2023a), and MMVP (Tong et al., 2024).
Dataset Splits	Yes	In terms of the data, all models are trained on 665 K plus 558 K multi-modal samples from LLa VA-1.5 (Liu et al., 2024a) if there is no other statement. During the fully fine-tuning stage, ... Regarding the data, our best model is optimized on the 4 M visual instruction data for 1 epoch ( 30K steps). For Haplo VL-7B, ... we first tune the connector between the pre-decoder and post-decoder using 558 K caption data and then fully tune the model using 665 K instruction data. For Haplo VL-8B with the ability to input any resolution, we first tune the whole model using 1.2 M caption data (Chen et al., 2023) and then tune the model using 4 M instruction data (Li et al., 2024a). For the models that support the multi-image and video input, we continue training the single-image model using the mix of interleaved data and single-image data. For the ablation experiments, the models are optimized on the 0.6 M visual instruction data for 5 K steps.
Hardware Specification	Yes	All models are optimized using the Adam W (Loshchilov & Hutter, 2019) optimizer and cosine scheduler on 32 GPUs with 64GB per-device memory.
Software Dependencies	No	The paper mentions 'Py Torch-like pseudo code' for the current token prediction loss and 'Adam W' for the optimizer, but does not provide specific version numbers for PyTorch or any other software libraries used.
Experiment Setup	Yes	During the pre-training stage, we optimize the postdecoder for 40 K steps with 1 e 4 learning rate, a batch size of 256, and 2 K warm-up steps. In terms of the data, all models are trained on 665 K plus 558 K multi-modal samples from LLa VA-1.5 (Liu et al., 2024a) if there is no other statement. During the fully fine-tuning stage, the learning rate is set to 2 e 5 and batch size to 128. Regarding the data, our best model is optimized on the 4 M visual instruction data for 1 epoch ( 30K steps). All models are optimized using the Adam W (Loshchilov & Hutter, 2019) optimizer and cosine scheduler on 32 GPUs with 64GB per-device memory. More details are recorded in the Appendix.