Storybooth: Training-Free Multi-Subject Consistency for Improved Visual Storytelling

Authors: Jaskirat Singh, Junshen K Chen, Jonas Kohler, Michael Cohen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through both qualitative and quantitative results we find that the proposed approach surpasses prior state-of-the-art, exhibiting improved consistency across both multiple-characters and fine-grain subject details. ... Experimental analysis reveals (Sec. 5) that the proposed approach allows for improved character consistency and text-to-image alignment while exhibiting 30 faster inference time than optimization-based methods (Ruiz et al., 2022).
Researcher Affiliation Collaboration Jaskirat Singh1,2 Junshen K. Chen1 Jonas Kohler 1 Michael Cohen1 1Meta Gen AI 2Australian National University
Pseudocode No The paper describes the method using equations and textual descriptions of layers and processes, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code No The paper does not contain any explicit statements about releasing source code or provide links to a code repository.
Open Datasets Yes For consistency, the storyboard prompt dataset from Tewel et al. (2024) is used for evaluating single-subject generation. We also construct an analogous multi-subject dataset (refer appendix) placing two randomly selected subjects in different settings.
Dataset Splits No The paper mentions using a 'storyboard prompt dataset from Tewel et al. (2024)' and constructing a 'multi-subject dataset' but does not specify any training, validation, or test splits for these datasets.
Hardware Specification Yes For a fair comparison, all methods are benchmarked on a single Nvidia-H100 GPU, using the same base model as (Zhou et al., 2024) for generation.
Software Dependencies No The paper mentions various models and methods like 'Textual-inversion', 'DB-Lo RA', 'IP-Adapter', 'BLIP-Diffusion', 'Storygen', 'Consistory', and 'Storydiffusion', but does not provide specific version numbers for software libraries or dependencies used in their implementation.
Experiment Setup Yes Our key insight here is to introduce an additional dropout term (see Eq. 3) which randomly allows each token to also pay attention to other global level tokens (e.g., for background) with a small dropout-probability ϑd. ... Since early parts of the reverse diffusion process are primarily responsible for positional or layout consolidation, we use the above insight to increase pose-variance by using a negative ϖ = 0.5 during the initial timesteps t [1000, 950]. A positive ϖ = 0.4 is then used for t [950, 600] in order to improve visual consistency.