Is Your Image a Good Storyteller?

Authors: Xiujie Song, Xiaoyi Pang, Haifeng Tang, Mengyue Wu, Kenny Q. Zhu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on our dataset demonstrate the effectiveness of our approach. Experiments show that ISA task is challenging for traditional vision models like Vi T (Dosovitskiy et al. 2021) and our proposed method significantly outperforms other baseline models on the Semantic Complexity Scoring task. In this section, we present the experimental setup, results, and corresponding analysis. Tables 2, 3, 4, and 5 display performance metrics (RMSE, RMAE, PCC, SRCC) and ablation studies.
Researcher Affiliation Collaboration 1 X-LANCE Lab, Department of Computer Science and Engineering Mo E Key Lab of Artificial Intelligence, AI Institute Shanghai Jiao Tong University, Shanghai, China 2 China Merchants Bank Credit Card Center, Shanghai, China 3 University of Texas at Arlington, Arlington, Texas, USA
Pseudocode No The paper describes the Vision-Language collaborative ISA (VLISA) method and its components (Feature Extractor and Discriminator) in Section 3, along with two versions (Naive VLISA and Chain-of-Thought VLISA). However, it does so using descriptive text and flowcharts, without presenting structured pseudocode or algorithm blocks.
Open Source Code Yes Data and code https://github.com/xiujiesong/ISA
Open Datasets Yes To promote the research on ISA task, we built the first ISA dataset with 2,946 images. Data and code https://github.com/xiujiesong/ISA
Dataset Splits Yes Table 1 shows the distribution of our dataset. We randomly split the data into a training set, a validation set and a test set in a 6:2:2 ratio.
Hardware Specification Yes Each model is trained and evaluated on either a single NVIDIA A10 GPU or a Tesla V100 GPU.
Software Dependencies No We implement the models using Py Torch (Paszke et al. 2019) and Transformers (Wolf et al. 2020). They are fine-tuned based on vit-base-patch16-224, vilt-b32-mlm, bert-base-uncased, and longformer-base-4096, respectively. While frameworks and specific models are mentioned, explicit version numbers for PyTorch or Transformers libraries are not provided.
Experiment Setup Yes For Vi T, Vi LT, BERT, and Longformer, we train them with batch size 16. The maximum text input length of Vi LT and BERT is set to 512 tokens. The maximum input length of Longformer is set to 1024 tokens. We repeat all experiments three times and calculate the mean and standard deviation.