Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty
Authors: Meera Hahn, Wenjun Zeng, Nithish Kannen, Rich Galt, Kartikeya Badola, Been Kim, Zi Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments over the three datasets demonstrate the proposed T2I agents ability to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard T2I generation. Moreover, we conducted human studies and observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow, highlighting the effectiveness of our approach. |
| Researcher Affiliation | Industry | 1Google Deep Mind. Correspondence to: Zi Wang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Belief parsing and interaction |
| Open Source Code | Yes | Code and Design Bench can be found at https: //github.com/google-deepmind/ proactive_t2i_agents. |
| Open Datasets | Yes | We experiment over three image-text datasets: Image In Words (Garg et al., 2024), COCO (Lin et al., 2014) and Design Bench, a benchmark we curated with strong artistic and design elements. Code and Design Bench can be found at https: //github.com/google-deepmind/ proactive_t2i_agents. Design Bench3 https://huggingface.co/datasets/meerahahn/Design Bench |
| Dataset Splits | Yes | We evaluate over the Coco-Captions dataset validation split (Chen et al., 2015) |
| Hardware Specification | No | We implement the agent belief parsing and interaction in Algorithm 1 on top of the Gemini 1.5 (Gemini Team Google, 2024) using the default temperature and a 32K context length. For T2I generation, we use Imagen 3 (Baldridge et al., 2024) across all baselines given its recency and prompt-following capabilities. We used both the models served publicly using the Vertex API.2 https://cloud.google.com/vertex-ai |
| Software Dependencies | No | We implement the agent belief parsing and interaction in Algorithm 1 on top of the Gemini 1.5 (Gemini Team Google, 2024) using the default temperature and a 32K context length. For T2I generation, we use Imagen 3 (Baldridge et al., 2024) across all baselines given its recency and prompt-following capabilities. We used both the models served publicly using the Vertex API.2 https://cloud.google.com/vertex-ai |
| Experiment Setup | Yes | We implement the agent belief parsing and interaction in Algorithm 1 on top of the Gemini 1.5 (Gemini Team Google, 2024) using the default temperature and a 32K context length. |