Storybooth: Training-Free Multi-Subject Consistency for Improved Visual Storytelling
Authors: Jaskirat Singh, Junshen K Chen, Jonas Kohler, Michael Cohen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through both qualitative and quantitative results we find that the proposed approach surpasses prior state-of-the-art, exhibiting improved consistency across both multiple-characters and fine-grain subject details. ... Experimental analysis reveals (Sec. 5) that the proposed approach allows for improved character consistency and text-to-image alignment while exhibiting 30 faster inference time than optimization-based methods (Ruiz et al., 2022). |
| Researcher Affiliation | Collaboration | Jaskirat Singh1,2 Junshen K. Chen1 Jonas Kohler 1 Michael Cohen1 1Meta Gen AI 2Australian National University |
| Pseudocode | No | The paper describes the method using equations and textual descriptions of layers and processes, but no explicit pseudocode or algorithm blocks are provided. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or provide links to a code repository. |
| Open Datasets | Yes | For consistency, the storyboard prompt dataset from Tewel et al. (2024) is used for evaluating single-subject generation. We also construct an analogous multi-subject dataset (refer appendix) placing two randomly selected subjects in different settings. |
| Dataset Splits | No | The paper mentions using a 'storyboard prompt dataset from Tewel et al. (2024)' and constructing a 'multi-subject dataset' but does not specify any training, validation, or test splits for these datasets. |
| Hardware Specification | Yes | For a fair comparison, all methods are benchmarked on a single Nvidia-H100 GPU, using the same base model as (Zhou et al., 2024) for generation. |
| Software Dependencies | No | The paper mentions various models and methods like 'Textual-inversion', 'DB-Lo RA', 'IP-Adapter', 'BLIP-Diffusion', 'Storygen', 'Consistory', and 'Storydiffusion', but does not provide specific version numbers for software libraries or dependencies used in their implementation. |
| Experiment Setup | Yes | Our key insight here is to introduce an additional dropout term (see Eq. 3) which randomly allows each token to also pay attention to other global level tokens (e.g., for background) with a small dropout-probability ϑd. ... Since early parts of the reverse diffusion process are primarily responsible for positional or layout consolidation, we use the above insight to increase pose-variance by using a negative ϖ = 0.5 during the initial timesteps t [1000, 950]. A positive ϖ = 0.4 is then used for t [950, 600] in order to improve visual consistency. |