reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Is Your Image a Good Storyteller?

Authors: Xiujie Song, Xiaoyi Pang, Haifeng Tang, Mengyue Wu, Kenny Q. Zhu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on our dataset demonstrate the effectiveness of our approach. Experiments show that ISA task is challenging for traditional vision models like Vi T (Dosovitskiy et al. 2021) and our proposed method significantly outperforms other baseline models on the Semantic Complexity Scoring task. In this section, we present the experimental setup, results, and corresponding analysis. Tables 2, 3, 4, and 5 display performance metrics (RMSE, RMAE, PCC, SRCC) and ablation studies.
Researcher Affiliation	Collaboration	1 X-LANCE Lab, Department of Computer Science and Engineering Mo E Key Lab of Artificial Intelligence, AI Institute Shanghai Jiao Tong University, Shanghai, China 2 China Merchants Bank Credit Card Center, Shanghai, China 3 University of Texas at Arlington, Arlington, Texas, USA
Pseudocode	No	The paper describes the Vision-Language collaborative ISA (VLISA) method and its components (Feature Extractor and Discriminator) in Section 3, along with two versions (Naive VLISA and Chain-of-Thought VLISA). However, it does so using descriptive text and flowcharts, without presenting structured pseudocode or algorithm blocks.
Open Source Code	Yes	Data and code https://github.com/xiujiesong/ISA
Open Datasets	Yes	To promote the research on ISA task, we built the first ISA dataset with 2,946 images. Data and code https://github.com/xiujiesong/ISA
Dataset Splits	Yes	Table 1 shows the distribution of our dataset. We randomly split the data into a training set, a validation set and a test set in a 6:2:2 ratio.
Hardware Specification	Yes	Each model is trained and evaluated on either a single NVIDIA A10 GPU or a Tesla V100 GPU.
Software Dependencies	No	We implement the models using Py Torch (Paszke et al. 2019) and Transformers (Wolf et al. 2020). They are fine-tuned based on vit-base-patch16-224, vilt-b32-mlm, bert-base-uncased, and longformer-base-4096, respectively. While frameworks and specific models are mentioned, explicit version numbers for PyTorch or Transformers libraries are not provided.
Experiment Setup	Yes	For Vi T, Vi LT, BERT, and Longformer, we train them with batch size 16. The maximum text input length of Vi LT and BERT is set to 512 tokens. The maximum input length of Longformer is set to 1024 tokens. We repeat all experiments three times and calculate the mean and standard deviation.