Preference Adaptive and Sequential Text-to-Image Generation
Authors: Ofir Nabati, Guy Tennenholtz, Chihwei Hsu, Moonkyung Ryu, Deepak Ramachandran, Yinlam Chow, Xiang Li, Craig Boutilier
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach for PASTA involves a multi-stage data collection and training process. We first collect multi-turn interaction data from human raters with a baseline LMM. Using this sequential data, as well as large-scale, open source (single-turn) preference data, we train a user simulator. Particularly, we employ an EM-strategy to train user preference and choice models, which capture implicit user preference types in the data. We then construct a new large-scale dataset, which consists of interactions between a simulated user and the LMM. Finally, we leverage this augmented data, encompassing both human and simulated interactions, to train PASTA our value-based RL agent, which presents a sequence of diverse slates of images to a user. PASTA interacts with the user and sequentially refines its generated images to better suit their underlying preferences. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. |
| Researcher Affiliation | Industry | 1Google Research 2Google Deep Mind. Correspondence to: Ofir Nabati <EMAIL>, Guy Tennenholtz <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Mini-Batch Expectation-Maximization User Model Optimization |
| Open Source Code | No | We also open-source our sequential rater dataset and simulated user-rater interactions to support future research in user-centric multi-turn T2I systems. Both human and simulation data is open-sourced to support research on multi-turn T2I generation. Link to the dataset: https://www.kaggle.com/datasets/googleai/pasta-data. |
| Open Datasets | Yes | We also open-source our sequential rater dataset and simulated user-rater interactions to support future research in user-centric multi-turn T2I systems. Both human and simulation data is open-sourced to support research on multi-turn T2I generation. Link to the dataset: https://www.kaggle.com/datasets/googleai/pasta-data. All of our datasets are open-sourced here: https://www.kaggle.com/datasets/googleai/pasta-data. |
| Dataset Splits | No | First, we assess prediction accuracy of the and Pick-a-Pic testset (Kirstain et al., 2023) and ranking using Spearman’s rank correlation (Spearman, 1961) on the HPS dataset (Wu et al., 2023). Second, we evaluate prompt choice prediction accuracy and cross-turn preference accuracy on our human-rated data. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments, such as specific GPU or CPU models. It only mentions the models used (Stable Diffusion XL, Gemini 1.5 Flash, Gemma 2B) but not the underlying hardware. |
| Software Dependencies | No | The paper mentions optimizers like Adam W and Adafactor, and models like Stable Diffusion XL, Gemini 1.5 Flash, and Gemma 2B, but does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch, TensorFlow) that would be needed for replication. |
| Experiment Setup | Yes | E.4. Hyperparameters Table 1: Main training hyperparameters Learning rate Cosine annealing scheduler (lr=3e-4, T=10e3) (Loshchilov & Hutter, 2016) Training steps 50e3 Batch size 2048 Update target network phase 256 Optimizer Adam W (Loshchilov & Hutter, 2019) (weight decay = 1e-4) κ1 1 αprior 0.999 Table 2: Fine-tuning hyperparameters Learning rate Cosine annealing scheduler (lr=3e-7, T=10e3) (Loshchilov & Hutter, 2016) Training steps 50e3 Batch size 8 Gradient norm clipping 0.5 Update target network phase 256 Optimizer Adam W (Loshchilov & Hutter, 2019) (weight decay = 1e-2) κ2 0.01 κ3 1 κ4 0.1 αprior 0.999 τmax 3 Table 3: PASTA hyperparameters Learning rate 1e-5 Training steps 1e4 Batch size 128 Optimizer Adafactor (Shazeer & Stern, 2018) (weight decay = 1e-2) Gradient norm clipping 1 Expectile parameter α 0.7 ℓq 651 ℓv 651 L 4 M 4 LC 25 Number of categories 5 H 5 N max w 62 |