reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Teaching Human Behavior Improves Content Understanding Abilities Of VLMs

Authors: SOMESH SINGH, Harini S I, Yaman Singla, Changyou Chen, Rajiv Ratn Shah, Veeky Baths, Balaji Krishnamurthy

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that training VLMs to predict receiver behaviors, such as likes, comments, and replay graphs, which are available at scale, enhances the VLM s performance across a broad range of downstream content understanding tasks. We show this performance increase over 6 types of behavior, 46 different tasks covering image, video, text and audio over 26 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines on diverse tasks ranging from emotion recognition to captioning by upto 150%. In the experimental results, we aim to showcase the diverse and emergent capabilities of our Behavior LLa VA model through quantitative numbers on various tasks and qualitative examples.
Researcher Affiliation	Collaboration	Adobe Media and Data Science Research (MDSR) SUNY at Buffalo, CNRL and APPCAIR at BITS Pilani
Pseudocode	No	The paper describes the methodology in prose (Section 2 Methodology) and provides instruction fine-tuning templates in 'Listing 1', 'Listing 6', and 'Listing 7', but does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	We also release BLIFT, our Behaviour-LLa VA IFT dataset comprising 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this. The dataset and code are available at behavior-in-the-wild.github.io/behavior-llava.
Open Datasets	Yes	We also release BLIFT, our Behaviour-LLa VA IFT dataset comprising 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this. The dataset and code are available at behavior-in-the-wild.github.io/behavior-llava. SALICON (Jiang et al., 2015) and Cheng et al. (2014) for visual saliency (10k images each), CELER (Berzak et al., 2022) and Dundee corpus (Kennedy et al., 2013)
Dataset Splits	No	The paper mentions evaluating on an 'eval set' (Section 2.2) and for specific benchmarks, such as HVU, that 'Performance evaluation on HVU tasks is conducted using the mean average precision (m AP) metric on the validation set' (Section 3.1). However, it does not explicitly provide the training/validation/test splits for their newly created BLIFT dataset with specific percentages or counts.
Hardware Specification	No	The paper lists training hyperparameters and software components in Appendix E.1 (e.g., 'Deepspeed Zero2 with Offload', 'Base Model: llama-vid-13b-full-224-video-fps-1'), but does not specify the exact hardware details such as GPU models (e.g., NVIDIA A100) or CPU models used for experimentation.
Software Dependencies	Yes	Appendix E.1 and E.2 list several software dependencies with their specific versions or models: 'Base Model: llama-vid-13b-full-224-video-fps-1', 'Vision Tower: LAVIS/eva vit g.pth', 'Image Processor: processor/clip-patch14-224', 'Scene Splitting: pyscenedetect (Breakthrough, 2023)', 'ASR: openai/whisper-large-v3', and 'Caption and Keywords: llava-v1.6-vicuna-13b'.
Experiment Setup	Yes	Appendix E.1 'TRAINING HYPERPARAMETERS' provides a detailed experimental setup including specific hyperparameter values such as 'Number of Training Epochs: 2.2', 'Per-Device Training Batch Size: 4', 'Learning Rate: 2e-5', 'Weight Decay: 0', 'Warmup Ratio: 0.03', 'Learning Rate Scheduler: Cosine', and 'Maximum Sequence Length: 2048', among others.