VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Authors: Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we introduce comprehensive experiments to evaluate our method on various visual understanding and generation tasks. Firstly, we outline our experimental setup, including the model architecture, training datasets, and evaluation benchmarks. Subsequently, we evaluate the performance of our unified foundation vision tower. Then, we compare our method with other popular VLMs on various visual understanding and generation benchmarks. Finally, we give some qualitative results. |
| Researcher Affiliation | Collaboration | Yecheng Wu1,2 Zhuoyang Zhang2 Junyu Chen1,2 Haotian Tang2 Dacheng Li4 Yunhao Fang5 Ligeng Zhu3 Enze Xie3 Hongxu Yin3 Li Yi1 Song Han2,3 Yao Lu3 Tsinghua University1 MIT2 NVIDIA3 UC Berkeley4 UC San Diego5 |
| Pseudocode | No | The paper describes the methods in Section 3, including 'UNIFIED FOUNDATION VISION TOWER' and 'UNIFIED MULTI-MODAL GENERATIVE PRE-TRAINING', but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is open sourced at https://github.com/mit-han-lab/vila-u. |
| Open Datasets | Yes | We train our vision tower on COYO-700M (Byeon et al., 2022) and evaluate it for zero-shot classification and reconstruction performance on Image Net (Deng et al., 2009b). For visual understanding, we leverage 1M [image, text] data from Share GPT4V (Chen et al., 2023), 6M interleaved text and image data from MMC4 (Zhu et al., 2024). For visual generation, we incorporate 15M high-quality [text, image] data curated from our internal dataset and 1M [text, video] data from Open Vid (Nan et al., 2024) datasets. ... For examining visual understanding ability, we evaluate our model on the widely adopted zero-shot image-based visual-language benchmarks including VQAv2 (Goyal et al., 2017), GQA (Hudson & Manning, 2019), Text VQA (Singh et al., 2019), POPE (Li et al., 2023d), MME (Fu et al., 2024), SEED (Li et al., 2023a), MM-Vet (Yu et al., 2023b) and video-based visual-language benchmarks including Activity Net (Caba Heilbron et al., 2015), MSVD (Chen & Dolan, 2011), MSRVTT (Xu et al., 2017), TGIF (Li et al., 2016). To evaluate the visual generation capability, we use MJHQ-30K (Li et al., 2024) and Gen AI-Bench (Lin et al., 2024) for image generation and VBench (Huang et al., 2024) for video generation. |
| Dataset Splits | No | The paper mentions using various public datasets like Share GPT4V and MMC4 for pre-training, and evaluates on standard benchmarks like VQAv2 and Gen AI-Bench. However, it does not provide specific details on how these pre-training datasets were split into training, validation, and test sets for their own experiments, or specific percentages/counts for these splits, beyond using existing evaluation benchmarks for testing. |
| Hardware Specification | No | The paper describes the model architecture and components used (e.g., 'LLa MA-2-7B', 'Sig LIP-Large-patch16-256'), but does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions using specific models like LLa MA-2-7B and Sig LIP, but it does not specify software dependencies such as programming languages or library versions (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | In our experiments, we employ LLa MA-2-7B (Touvron et al., 2023b) as our base language model. For the vision tower, we choose Sig LIP-Large-patch16-256 / Sig LIP-SO400M-patch14-384 (Zhai et al., 2023) as our vision encoder architecture, and adopt the residual quantizer, depth transformer as well as the decoder architecture from RQ-VAE (Lee et al., 2022). The quantizer codebook size is 16384. All images and videos are resized to a resolution of 256 256 / 384 384, with each image or video frame converted into a 16 16 4 / 27 27 16 code with the residual depth D = 4 / D = 16. We train our vision tower on COYO-700M (Byeon et al., 2022)... Classifier-free guidance (Ho & Salimans, 2022) is employed for visual generation with a CFG value of 3. ... We use weighted sum to combine the text-image contrastive loss and VQ-based image reconstruction loss: Ltotal = wcontra Lcontra + wrecon Lrecon (1) In our experiments, we pick wcontra = 1 and wrecon = 1. |