reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

Authors: Fei Shen, Hu Ye, Sibo Liu, Jun Zhang, Cong Wang, Xiao Han, Yang Wei

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments Datasets. We conduct experiments on the Flintstones SV (Maharana and Bansal 2021) and Pororo SV (Li et al. 2019) datasets. The former contains 20,132 training sequences and 2,309 testing sequences, encompassing 7 main characters. Pororo SV includes 10,191 training sequences and 2,208 testing sequences, covering 9 main characters. Metrics. We conduct a comprehensive evaluation of the model, considering both objective and subjective metrics. Objective indicators include classification accuracy of characters (Char-Acc) and F1-score of characters (Char-F1), both extracted using Inception V3. Additionally, we also consider the fr echet inception distance (FID) (Heusel et al. 2017) score. Quantitative Results. As shown in Table 1, firstly, since LDM (Rombach et al. 2022a) generates each image based solely on individual captions, it performs significantly worse than all other methods on three metrics. Qualitative Results. As some methods have yet to be opensourced, we qualitatively compared RCDMs with Story GAN (Li et al. 2019), Story-DALL-E (Maharana, Hannan, and Bansal 2022), AR-LDM (Pan et al. 2024), and Story LDM (Rahman et al. 2023) on the Flintstones SV and Pororo SV datasets. User Study. The above quantitative and qualitative comparison results demonstrate the substantial advantages of our proposed RCDMs in generating results. However, the task of synthesizing story visualization is often human perceptionoriented. Therefore, we also conduct a user study involving 100 volunteers with computer vision backgrounds. Ablation Study We further devise several variants to demonstrate the efficacy of each module proposed in this study.
Researcher Affiliation	Collaboration	Fei Shen1,2*, Hu Ye2 , Sibo Liu2, Jun Zhang2 , Cong Wang2, Xiao Han2, Wei Yang2 1Nanjing University of Science and Technology 2Tencent AI Lab
Pseudocode	No	The paper describes methods using text and mathematical equations (e.g., Eq. 1, Eq. 2, Eq. 3, Eq. 4) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper references a link to a third-party model: "1https://huggingface.co/runwayml/stable-diffusion-v1-5". However, there is no explicit statement or link indicating that the authors' own code for the RCDMs methodology described in the paper is open-sourced or available.
Open Datasets	Yes	We conduct experiments on the Flintstones SV (Maharana and Bansal 2021) and Pororo SV (Li et al. 2019) datasets.
Dataset Splits	Yes	We conduct experiments on the Flintstones SV (Maharana and Bansal 2021) and Pororo SV (Li et al. 2019) datasets. The former contains 20,132 training sequences and 2,309 testing sequences, encompassing 7 main characters. Pororo SV includes 10,191 training sequences and 2,208 testing sequences, covering 9 main characters.
Hardware Specification	Yes	We perform our experiments on 8 NVIDIA V100 GPUs.
Software Dependencies	Yes	For the frame-contextual 3D diffusion model, we use the pretrained Stable Diffusion V1.5
Experiment Setup	Yes	We employ the Adam W optimizer with a fixed learning rate of 1e 5 in all stages. (3) Following (Pan et al. 2024; Rahman et al. 2023), we train our models using images of sizes 512 x 512 for Flintstones SV and Pororo SV dataset. (4) We employ a data augmentation strategy of dropping images in all two stages, with the drop count ranging from 0 to 5. We substitute the dropped images with black images. (5) In the inference stage, we use the DDIM (Ho, Jain, and Abbeel 2020) sampler with 20 steps and set the guidance scale w to 2.0 for RCDMs on all stages.