VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence
Authors: Hao Li, Hao Fei, Zechao Hu, Zhengwei Yang, Zheng Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments, comprising modality ablation, open-ended assessments, and supervised MCQ evaluations, consistently show that VEGAS effectively utilizes visual information in reasoning to produce correct and also credible answers. |
| Researcher Affiliation | Academia | Hao Li1, Hao Fei2, Zechao Hu1, Zhengwei Yang1, Zheng Wang1* 1 National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University 2 School of Computing, National University of Singapore EMAIL, EMAIL |
| Pseudocode | No | The paper describes methodologies such as Language Guided Sampling (LGS) and Temporal Attention Module (TAM) using text and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/lihao921/VEGAS |
| Open Datasets | Yes | In this study, we report results on Social IQ-2.0 and leverage various datasets and their transformations in training as Table 1 shows. For the LGS, we craft data based on TVQA (Lei et al. 2018), Next-QA (Xiao et al. 2021), and Video-Chat GPT (Maaz et al. 2023). For the STP, we use RAVDESS (Livingstone and Russo 2018), Audio Caps (Kim et al. 2019), CMU-MOSEI (Zadeh et al. 2016), and Expression in-the-Wild (Exp W) (Zhang et al. 2018). For VEGAS-generalist, we integrate TVQA and CMU-MOSEI for multimodal joint training. We incorporate expert insights distilled by Chat GPT from Social-IQ data (Zadeh et al. 2019) to provide in-depth analysis. |
| Dataset Splits | No | The paper mentions using |
| Hardware Specification | Yes | All training is conducted on 4 A100 40G GPUs with a batch size of 64. |
| Software Dependencies | No | The paper mentions specific models like Vicuna-7b LLM, CLIP Vi T-B/32, and T5-small, and uses GPT-3.5-turbo for evaluation, but it does not list general software dependencies with version numbers (e.g., Python, PyTorch). |
| Experiment Setup | Yes | Training Details. For the sampler, we set n = 32 and k = 8, and encode language hints with the text encoder from CLIP Vi T-B/32 (Radford et al. 2021). We initiate the LGS from scratch and train the sampling process with a learning rate of 2e-4. The STP module is pre-trained with a learning rate of 1e-6. For the joint tuning of the STP and LLM, we set their learning rates to 2e-5 and 2e-4, respectively and train for three epochs. The Vicuna-7b LLM (Chiang et al. 2023) is fine-tuned using Low-Rank Adaptation (Lo RA) with r = 128 and α = 256. Note that the joint tuning is performed for both VEGAS-generalist in open-ended QA and VEGAS in supervised MCQ, but on different datasets. All training is conducted on 4 A100 40G GPUs with a batch size of 64. All training proceeds for one epoch except for supervised MCQ, which is trained for three epochs. |