ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context
Authors: Sixiao Zheng, Yanwei Fu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Pororo SV and Flintstones SV datasets demonstrate that Contextual Story significantly outperforms existing SOTA methods in both story visualization and continuation. |
| Researcher Affiliation | Academia | 1Fudan University 2Shanghai Innovation Institute EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods in text and uses architectural diagrams (Figure 2, Figure 4) but does not present structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We employ two popular benchmark datasets, Pororo SV (Li et al. 2019) and Flintstones SV (Gupta et al. 2018), to evaluate the performance of our model on story visualization and story continuation tasks. |
| Dataset Splits | Yes | Pororo SV contains 10,191, 2,334, and 2,208 stories within the train, validation, and test splits, respectively, featuring 9 main characters. Flintstones SV contains 20,132, 2,071, and 2,309 stories within the train, validation, and test splits, respectively, featuring 7 main characters and 323 backgrounds. |
| Hardware Specification | Yes | Training is performed on 4 NVIDIA A800 GPUs with a batch size of 12, a learning rate of 5 10 5 and 40,000 iterations for Pororo SV and 80,000 iterations for Flintstones SV. ... The experiment is conducted on an A800 GPU with 50 DDIM steps to ensure a fair comparison. |
| Software Dependencies | Yes | We initialize Contextual Story with the pre-trained Stable Diffusion 2.1-base and fine-tune only the UNet parameters with the Adam W optimizer. |
| Experiment Setup | Yes | Training is performed on 4 NVIDIA A800 GPUs with a batch size of 12, a learning rate of 5 10 5 and 40,000 iterations for Pororo SV and 80,000 iterations for Flintstones SV. The SETA window size is k = 3, and the SC layer count is 4. During training, we apply classifier-free guidance by ran-domly dropping input storylines with a 0.1 probability and use the PYo Co mixed noise prior for noise initialization. For inference, we use the DDIM sampler with 50 steps and a guidance scale of 7.5 to generate 256 256 images. |