Teaching Human Behavior Improves Content Understanding Abilities Of VLMs
Authors: SOMESH SINGH, Harini S I, Yaman Singla, Changyou Chen, Rajiv Ratn Shah, Veeky Baths, Balaji Krishnamurthy
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that training VLMs to predict receiver behaviors, such as likes, comments, and replay graphs, which are available at scale, enhances the VLM s performance across a broad range of downstream content understanding tasks. We show this performance increase over 6 types of behavior, 46 different tasks covering image, video, text and audio over 26 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines on diverse tasks ranging from emotion recognition to captioning by upto 150%. In the experimental results, we aim to showcase the diverse and emergent capabilities of our Behavior LLa VA model through quantitative numbers on various tasks and qualitative examples. |
| Researcher Affiliation | Collaboration | Adobe Media and Data Science Research (MDSR) SUNY at Buffalo, CNRL and APPCAIR at BITS Pilani |
| Pseudocode | No | The paper describes the methodology in prose (Section 2 Methodology) and provides instruction fine-tuning templates in 'Listing 1', 'Listing 6', and 'Listing 7', but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We also release BLIFT, our Behaviour-LLa VA IFT dataset comprising 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this. The dataset and code are available at behavior-in-the-wild.github.io/behavior-llava. |
| Open Datasets | Yes | We also release BLIFT, our Behaviour-LLa VA IFT dataset comprising 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this. The dataset and code are available at behavior-in-the-wild.github.io/behavior-llava. SALICON (Jiang et al., 2015) and Cheng et al. (2014) for visual saliency (10k images each), CELER (Berzak et al., 2022) and Dundee corpus (Kennedy et al., 2013) |
| Dataset Splits | No | The paper mentions evaluating on an 'eval set' (Section 2.2) and for specific benchmarks, such as HVU, that 'Performance evaluation on HVU tasks is conducted using the mean average precision (m AP) metric on the validation set' (Section 3.1). However, it does not explicitly provide the training/validation/test splits for their newly created BLIFT dataset with specific percentages or counts. |
| Hardware Specification | No | The paper lists training hyperparameters and software components in Appendix E.1 (e.g., 'Deepspeed Zero2 with Offload', 'Base Model: llama-vid-13b-full-224-video-fps-1'), but does not specify the exact hardware details such as GPU models (e.g., NVIDIA A100) or CPU models used for experimentation. |
| Software Dependencies | Yes | Appendix E.1 and E.2 list several software dependencies with their specific versions or models: 'Base Model: llama-vid-13b-full-224-video-fps-1', 'Vision Tower: LAVIS/eva vit g.pth', 'Image Processor: processor/clip-patch14-224', 'Scene Splitting: pyscenedetect (Breakthrough, 2023)', 'ASR: openai/whisper-large-v3', and 'Caption and Keywords: llava-v1.6-vicuna-13b'. |
| Experiment Setup | Yes | Appendix E.1 'TRAINING HYPERPARAMETERS' provides a detailed experimental setup including specific hyperparameter values such as 'Number of Training Epochs: 2.2', 'Per-Device Training Batch Size: 4', 'Learning Rate: 2e-5', 'Weight Decay: 0', 'Warmup Ratio: 0.03', 'Learning Rate Scheduler: Cosine', and 'Maximum Sequence Length: 2048', among others. |