Teaching Human Behavior Improves Content Understanding Abilities Of VLMs

Authors: SOMESH SINGH, Harini S I, Yaman Singla, Changyou Chen, Rajiv Ratn Shah, Veeky Baths, Balaji Krishnamurthy

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that training VLMs to predict receiver behaviors, such as likes, comments, and replay graphs, which are available at scale, enhances the VLM s performance across a broad range of downstream content understanding tasks. We show this performance increase over 6 types of behavior, 46 different tasks covering image, video, text and audio over 26 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines on diverse tasks ranging from emotion recognition to captioning by upto 150%. In the experimental results, we aim to showcase the diverse and emergent capabilities of our Behavior LLa VA model through quantitative numbers on various tasks and qualitative examples.
Researcher Affiliation Collaboration Adobe Media and Data Science Research (MDSR) SUNY at Buffalo, CNRL and APPCAIR at BITS Pilani
Pseudocode No The paper describes the methodology in prose (Section 2 Methodology) and provides instruction fine-tuning templates in 'Listing 1', 'Listing 6', and 'Listing 7', but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We also release BLIFT, our Behaviour-LLa VA IFT dataset comprising 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this. The dataset and code are available at behavior-in-the-wild.github.io/behavior-llava.
Open Datasets Yes We also release BLIFT, our Behaviour-LLa VA IFT dataset comprising 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this. The dataset and code are available at behavior-in-the-wild.github.io/behavior-llava. SALICON (Jiang et al., 2015) and Cheng et al. (2014) for visual saliency (10k images each), CELER (Berzak et al., 2022) and Dundee corpus (Kennedy et al., 2013)
Dataset Splits No The paper mentions evaluating on an 'eval set' (Section 2.2) and for specific benchmarks, such as HVU, that 'Performance evaluation on HVU tasks is conducted using the mean average precision (m AP) metric on the validation set' (Section 3.1). However, it does not explicitly provide the training/validation/test splits for their newly created BLIFT dataset with specific percentages or counts.
Hardware Specification No The paper lists training hyperparameters and software components in Appendix E.1 (e.g., 'Deepspeed Zero2 with Offload', 'Base Model: llama-vid-13b-full-224-video-fps-1'), but does not specify the exact hardware details such as GPU models (e.g., NVIDIA A100) or CPU models used for experimentation.
Software Dependencies Yes Appendix E.1 and E.2 list several software dependencies with their specific versions or models: 'Base Model: llama-vid-13b-full-224-video-fps-1', 'Vision Tower: LAVIS/eva vit g.pth', 'Image Processor: processor/clip-patch14-224', 'Scene Splitting: pyscenedetect (Breakthrough, 2023)', 'ASR: openai/whisper-large-v3', and 'Caption and Keywords: llava-v1.6-vicuna-13b'.
Experiment Setup Yes Appendix E.1 'TRAINING HYPERPARAMETERS' provides a detailed experimental setup including specific hyperparameter values such as 'Number of Training Epochs: 2.2', 'Per-Device Training Batch Size: 4', 'Learning Rate: 2e-5', 'Weight Decay: 0', 'Warmup Ratio: 0.03', 'Learning Rate Scheduler: Cosine', and 'Maximum Sequence Length: 2048', among others.