TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
Authors: Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various datasets validate the effectiveness of the proposed method. Remarkably, without the need for additional training data, our proposed method can bring more benefits to the baseline (LLa VA-1.5) compared with other concurrent methods. Furthermore, the proposed method consistently brings improvement in different settings. [...] In this section, we first present the detailed experimental setup. We then enumerate the improvements brought by our proposed TG-LLa VA over the baseline across multiple evaluation metrics, and compare our method with several stateof-the-art (So TA) approaches under various configurations. Specifically, we visualize the attention map to demonstrate the efficacy of proposed TG-LLa VA. Finally, we conduct ablation studies and provide an analysis of the results. |
| Researcher Affiliation | Collaboration | 1School of Cybersecurity, Northwestern Polytechnical University 2AI Business, Alibaba Group 3 College of Computer Science and Technology, Zhejiang University 4 College of Information and Control Engineering, Xi an University of Architecture and Technology 5School of Computer Science, Northwestern Polytechnical University |
| Pseudocode | No | The paper describes the method with architectural diagrams (Figure 2, 3, 4) and mathematical formulations (Eq 1-5) but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/AIDC-AI |
| Open Datasets | Yes | Datasets Focusing on proposing a novel optimization method for the VLM framework, we do not incorporate any additional data beyond the LLa VA-1.5 open-source dataset (Liu et al. 2024a), which has 558K image captions for pre-training and 665K conversations for instruction tuning. We also apply our TG-LLa VA to the Mini Gemini dataset (Reid et al. 2024), which consists of 1.2M + 1.5M data. For evaluation, we conduct extensive experiments and report results on widely-adopted VLM benchmarks using the VLMEval Kit (Duan et al. 2024) platform to provide robust and comprehensive performance validation for the proposed TG-LLa VA. The evaluation datasets include: MMBench (MMB) (Liu et al. 2023a), MMS (MMStar) (Chen et al. 2024a), MMMU (Yue et al. 2024), MV (Math Vista) (Lu et al. 2023), OCRB (OCRBench) (Liu et al. 2023b), AI2D (Hiippala et al. 2021), HB (Hallusion Bench)(Guan et al. 2024), LB (LLa VABench) (Liu et al. 2024c), SQA (Science QA) (Saikh et al. 2022), and MME (Fu et al. 2024). |
| Dataset Splits | No | The paper mentions using the LLaVA-1.5 open-source dataset for pre-training (558K image captions) and instruction fine-tuning (665K conversations), and the Mini Gemini dataset (1.2M + 1.5M data). It also lists various evaluation datasets. However, it does not explicitly state the training, validation, and test splits (e.g., percentages or exact counts) for any of these datasets in the main text, nor does it cite a specific resource for these splits. |
| Hardware Specification | Yes | The training process for TG-LLa VA utilizes the Py Torch framework and employs 8 H100-80G GPUs. |
| Software Dependencies | No | The paper mentions "Py Torch framework" but does not specify a version number or other software dependencies with specific version numbers. |
| Experiment Setup | Yes | For training configurations, we adhere strictly to the settings outlined in the original LLa VA-1.5 paper to ensure fairness, with learning rates of 1e-3 and 2e-5 for pre-training and instruction fine-tuning phases, respectively, and maintaining batch sizes of 256 and 128. DP module introduces 64 additional visual tokens. |