Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

Authors: Meera Hahn, Wenjun Zeng, Nithish Kannen, Rich Galt, Kartikeya Badola, Been Kim, Zi Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments over the three datasets demonstrate the proposed T2I agents ability to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard T2I generation. Moreover, we conducted human studies and observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow, highlighting the effectiveness of our approach.
Researcher Affiliation Industry 1Google Deep Mind. Correspondence to: Zi Wang <EMAIL>.
Pseudocode Yes Algorithm 1 Belief parsing and interaction
Open Source Code Yes Code and Design Bench can be found at https: //github.com/google-deepmind/ proactive_t2i_agents.
Open Datasets Yes We experiment over three image-text datasets: Image In Words (Garg et al., 2024), COCO (Lin et al., 2014) and Design Bench, a benchmark we curated with strong artistic and design elements. Code and Design Bench can be found at https: //github.com/google-deepmind/ proactive_t2i_agents. Design Bench3 https://huggingface.co/datasets/meerahahn/Design Bench
Dataset Splits Yes We evaluate over the Coco-Captions dataset validation split (Chen et al., 2015)
Hardware Specification No We implement the agent belief parsing and interaction in Algorithm 1 on top of the Gemini 1.5 (Gemini Team Google, 2024) using the default temperature and a 32K context length. For T2I generation, we use Imagen 3 (Baldridge et al., 2024) across all baselines given its recency and prompt-following capabilities. We used both the models served publicly using the Vertex API.2 https://cloud.google.com/vertex-ai
Software Dependencies No We implement the agent belief parsing and interaction in Algorithm 1 on top of the Gemini 1.5 (Gemini Team Google, 2024) using the default temperature and a 32K context length. For T2I generation, we use Imagen 3 (Baldridge et al., 2024) across all baselines given its recency and prompt-following capabilities. We used both the models served publicly using the Vertex API.2 https://cloud.google.com/vertex-ai
Experiment Setup Yes We implement the agent belief parsing and interaction in Algorithm 1 on top of the Gemini 1.5 (Gemini Team Google, 2024) using the default temperature and a 32K context length.