Personalized Visual Instruction Tuning

Authors: Renjie Pi, Jianshu Zhang, Tianyang Han, Jipeng Zhang, Rui Pan, Tong Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments demonstrate a substantial personalized performance enhancement after fine-tuning with our curated dataset. [...] 6 EXPERIMENTS [...] In this section, we demonstrate the effectiveness of our proposed PVIT on the constructed P-Bench. We first showcase the results of the PVIT-tuned LLa VA (Liu et al., 2023a), which demonstrates significantly higher performances than the SOTA MLLMs that support multi-image inputs. Then, we conduct ablation study on each of the components of our constructed data to demonstrate their contributions towards the final performance. [...] 6.1 MAIN RESULTS ON P-BENCH [...] 6.2 ABLATION STUDY
Researcher Affiliation Academia Renjie Pi1 , Jianshu Zhang1 , Tianyang Han1, Jipeng Zhang1, Rui Pan2, Tong Zhang2 1The Hong Kong University of Science and Technology 2University of Illinois Urbana-Champaign EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes its data generation framework and methods in narrative text and through a high-level diagram in Figure 1, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code Yes Code and data are available at the following links: https://github.com/sterzhang/PVIT https://huggingface.co/datasets/Sterzhang/PVIT-3M. The code and data are released under MIT and apache2.0 licenses, respectively.
Open Datasets Yes Code and data are available at the following links: https://github.com/sterzhang/PVIT https://huggingface.co/datasets/Sterzhang/PVIT-3M. The code and data are released under MIT and apache2.0 licenses, respectively. [...] We would like to clarify that all images used in the construction of our dataset and benchmark were collected from publicly available datasets, including Visual Genome (Krishna et al., 2016), COCO (Lin et al., 2015), Object365 (Shao et al., 2019), and Flickr30k (Plummer et al., 2016).
Dataset Splits No The paper states, 'We train the MLLM with a subset of our PVIT-3M with 1M samples' and provides detailed statistics for the P-Bench evaluation benchmark, including sample counts for different question types and numbers of people. However, it does not specify explicit training/validation/test splits (e.g., percentages or precise sample counts for each split) for the PVIT-3M dataset, nor does it describe how P-Bench's evaluation data is partitioned in relation to the training data in a reproducible manner.
Hardware Specification Yes The entire training is conducted on 8 A100 GPUs with 80GB memory, which lasted for 30 hours.
Software Dependencies No The paper lists model choices (LLa VA-1.6-7B, Intern VL2-26B, LLa MA3.1-8B-instruct, Grounding Dino) and various hyperparameters in Table 10. While command-line arguments and model names are provided, specific versions for underlying programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch), or other key software libraries are not explicitly mentioned to enable replication of the software environment.
Experiment Setup Yes In Table 10, we illustrate the detailed hyper-parameters used when fine-tuning the MLLM with our PVIT-3M. We wish to note that we start tuning from the checkpoint of LLa VA-7B Liu et al. (2023a). We train the MLLM with a subset of our PVIT-3M with 1M samples. The entire training is conducted on 8 A100 GPUs with 80GB memory, which lasted for 30 hours. Parameter Value --lora enable True --lora r 128 --lora alpha 256 --mm projector lr 1e-4 --deepspeed ./scripts/zero2.json --version v1 --vision tower openai/clip-vit-large-patch14-336 --mm projector type mlp2x gelu --mm vision select layer -2 --mm use im start end False --mm use im patch token False --image aspect ratio pad --group by modality length True --bf16 True --num train epochs 1 --per device train batch size 16 --per device eval batch size 4 --gradient accumulation steps 2 --evaluation strategy no --save strategy steps --save steps 50000 --save total limit 1 --learning rate 2e-4 --weight decay 0. --warmup ratio 0.03 --lr scheduler type cosine --logging steps 1 --tf32 True --model max length 4096 --gradient checkpointing True --dataloader num workers 4 --lazy preprocess True