Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models
Authors: Fei Shen, Hu Ye, Sibo Liu, Jun Zhang, Cong Wang, Xiao Han, Yang Wei
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments Datasets. We conduct experiments on the Flintstones SV (Maharana and Bansal 2021) and Pororo SV (Li et al. 2019) datasets. The former contains 20,132 training sequences and 2,309 testing sequences, encompassing 7 main characters. Pororo SV includes 10,191 training sequences and 2,208 testing sequences, covering 9 main characters. Metrics. We conduct a comprehensive evaluation of the model, considering both objective and subjective metrics. Objective indicators include classification accuracy of characters (Char-Acc) and F1-score of characters (Char-F1), both extracted using Inception V3. Additionally, we also consider the fr echet inception distance (FID) (Heusel et al. 2017) score. Quantitative Results. As shown in Table 1, firstly, since LDM (Rombach et al. 2022a) generates each image based solely on individual captions, it performs significantly worse than all other methods on three metrics. Qualitative Results. As some methods have yet to be opensourced, we qualitatively compared RCDMs with Story GAN (Li et al. 2019), Story-DALL-E (Maharana, Hannan, and Bansal 2022), AR-LDM (Pan et al. 2024), and Story LDM (Rahman et al. 2023) on the Flintstones SV and Pororo SV datasets. User Study. The above quantitative and qualitative comparison results demonstrate the substantial advantages of our proposed RCDMs in generating results. However, the task of synthesizing story visualization is often human perceptionoriented. Therefore, we also conduct a user study involving 100 volunteers with computer vision backgrounds. Ablation Study We further devise several variants to demonstrate the efficacy of each module proposed in this study. |
| Researcher Affiliation | Collaboration | Fei Shen1,2*, Hu Ye2 , Sibo Liu2, Jun Zhang2 , Cong Wang2, Xiao Han2, Wei Yang2 1Nanjing University of Science and Technology 2Tencent AI Lab |
| Pseudocode | No | The paper describes methods using text and mathematical equations (e.g., Eq. 1, Eq. 2, Eq. 3, Eq. 4) but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper references a link to a third-party model: "1https://huggingface.co/runwayml/stable-diffusion-v1-5". However, there is no explicit statement or link indicating that the authors' *own* code for the RCDMs methodology described in the paper is open-sourced or available. |
| Open Datasets | Yes | We conduct experiments on the Flintstones SV (Maharana and Bansal 2021) and Pororo SV (Li et al. 2019) datasets. |
| Dataset Splits | Yes | We conduct experiments on the Flintstones SV (Maharana and Bansal 2021) and Pororo SV (Li et al. 2019) datasets. The former contains 20,132 training sequences and 2,309 testing sequences, encompassing 7 main characters. Pororo SV includes 10,191 training sequences and 2,208 testing sequences, covering 9 main characters. |
| Hardware Specification | Yes | We perform our experiments on 8 NVIDIA V100 GPUs. |
| Software Dependencies | Yes | For the frame-contextual 3D diffusion model, we use the pretrained Stable Diffusion V1.5 |
| Experiment Setup | Yes | We employ the Adam W optimizer with a fixed learning rate of 1e 5 in all stages. (3) Following (Pan et al. 2024; Rahman et al. 2023), we train our models using images of sizes 512 x 512 for Flintstones SV and Pororo SV dataset. (4) We employ a data augmentation strategy of dropping images in all two stages, with the drop count ranging from 0 to 5. We substitute the dropped images with black images. (5) In the inference stage, we use the DDIM (Ho, Jain, and Abbeel 2020) sampler with 20 steps and set the guidance scale w to 2.0 for RCDMs on all stages. |