reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence

Authors: Hao Li, Hao Fei, Zechao Hu, Zhengwei Yang, Zheng Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments, comprising modality ablation, open-ended assessments, and supervised MCQ evaluations, consistently show that VEGAS effectively utilizes visual information in reasoning to produce correct and also credible answers.
Researcher Affiliation	Academia	Hao Li1, Hao Fei2, Zechao Hu1, Zhengwei Yang1, Zheng Wang1* 1 National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University 2 School of Computing, National University of Singapore EMAIL, EMAIL
Pseudocode	No	The paper describes methodologies such as Language Guided Sampling (LGS) and Temporal Attention Module (TAM) using text and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/lihao921/VEGAS
Open Datasets	Yes	In this study, we report results on Social IQ-2.0 and leverage various datasets and their transformations in training as Table 1 shows. For the LGS, we craft data based on TVQA (Lei et al. 2018), Next-QA (Xiao et al. 2021), and Video-Chat GPT (Maaz et al. 2023). For the STP, we use RAVDESS (Livingstone and Russo 2018), Audio Caps (Kim et al. 2019), CMU-MOSEI (Zadeh et al. 2016), and Expression in-the-Wild (Exp W) (Zhang et al. 2018). For VEGAS-generalist, we integrate TVQA and CMU-MOSEI for multimodal joint training. We incorporate expert insights distilled by Chat GPT from Social-IQ data (Zadeh et al. 2019) to provide in-depth analysis.
Dataset Splits	No	The paper mentions using
Hardware Specification	Yes	All training is conducted on 4 A100 40G GPUs with a batch size of 64.
Software Dependencies	No	The paper mentions specific models like Vicuna-7b LLM, CLIP Vi T-B/32, and T5-small, and uses GPT-3.5-turbo for evaluation, but it does not list general software dependencies with version numbers (e.g., Python, PyTorch).
Experiment Setup	Yes	Training Details. For the sampler, we set n = 32 and k = 8, and encode language hints with the text encoder from CLIP Vi T-B/32 (Radford et al. 2021). We initiate the LGS from scratch and train the sampling process with a learning rate of 2e-4. The STP module is pre-trained with a learning rate of 1e-6. For the joint tuning of the STP and LLM, we set their learning rates to 2e-5 and 2e-4, respectively and train for three epochs. The Vicuna-7b LLM (Chiang et al. 2023) is fine-tuned using Low-Rank Adaptation (Lo RA) with r = 128 and α = 256. Note that the joint tuning is performed for both VEGAS-generalist in open-ended QA and VEGAS in supervised MCQ, but on different datasets. All training is conducted on 4 A100 40G GPUs with a batch size of 64. All training proceeds for one epoch except for supervised MCQ, which is trained for three epochs.