Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
Authors: Cristina Nader Vasconcelos, Abdullah Rashwan, Austin Waters, Trevor Walker, Keyang Xu, Jimmy Yan, Rui Qian, Yeqing Li, SHIXIN LUO, Yasumasa Onoe, Zarana Parekh, Ivana Kajic, Mandy Guo, Wenlei Zhou, Sarah Rosston, Roopal Garg, Hongliang Fei, Jordi Pont-Tuset, Su Wang, Henna Nandwani, Andrew Bunner, Kevin Swersky, David J. Fleet, Oliver Wang, Jason Michael Baldridge
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first demonstrate the benefits of scaling a Shallow UNet, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024 1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL. The paper also includes extensive empirical evaluations on a publicly available dataset, namely, Conceptual 12M (or CC12M) (Changpinyo et al., 2021). The evaluation uses metrics such as FID, FD-Dino, CMMD, CLIPscore, and DSG, along with human evaluation studies. |
| Researcher Affiliation | Industry | Cristina N. Vasconcelos, Abdullah Rashwan, Austin Waters, Trevor Walker, Keyang Xu, Jimmy Yan, Rui Qian, Shixin Luo, Zarana Parekh, Andrew Bunner, Hongliang Fei, Roopal Garg, Mandy Guo, Ivana Kajic, Yeqing Li, Henna Nandwani, Jordi Pont-Tuset, Yasumasa Onoe, Sarah Rosston, Su Wang, Wenlei Zhou, Kevin Swersky, David J. Fleet, Jason M. Baldridge, Oliver Wang Google |
| Pseudocode | Yes | The greedy growing algorithm can be described as follows. Phase 1 In this phase, the core components of the chosen architecture are identified (see subsection 3.1), and a Shallow-UVi T model is build on top of them. The Shallow-UVi T is trained on the entire training collection of text-image pairs, as it is not limited to high resolution training images. Phase 2 The second phase greedily grows the Shallow-UVi T s encoder/decoder (namely, throwing away the lower-resolution blocks and adding higher-resolution blocks) to obtain the final model. More specifically, this phase adds encoder and decoder layers at different resolutions, while preserving the core representation layers at the spatial resolution used during the first phase. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | To avoid this issue, we first demonstrate our key findings through extensive empirical evaluations on a publicly available dataset, namely, Conceptual 12M (or CC12M) (Changpinyo et al., 2021). Image distribution metrics and Clip-Score are obtained using 30k prompts from the MSCOCO-captions validation set (Chen et al., 2015), while the semantic metrics are extracted on the 1k prompts from DSG-1k (Cho et al., 2024). We consider a simple counting task, defined here as the task of generating images of up to 5 objects based on a subset of text prompts from the numerical split of the Gecko benchmark (Wiles et al., 2024). |
| Dataset Splits | Yes | To ablate our hypothesis that greedy growing helps the model learn strong representations with larger, diverse corpora, we also train the full model on a high resolution subset of data used to train the Shallow-UVi T; i.e., we simply removed all samples with resolution lower than the target model resolution. To that end, beyond greedy growing, we explore the three training baselines: 1) We create a baseline with all layers trained from scratch on this subset; 2) As an alternative to the frozen phase in the greedy growing, we fine-tune the core components on this smaller high resolution subset jointly with the grown components (randomly initialized); and 3) A third baseline adds the optional phase of unfreezing the core components after warming up the random weights for 500k steps. Models are trained for 2M steps in total. We train Shallow-UVi T on the entire CC12M training set, while corresponding end-to-end models were trained with CC12M s subset of 8.7M images whose dimensions are equal or larger than 512 pixels. Shallow-UVi T models were trained on 64 64 images, by resizing the smallest dimension of the images to 64 and random cropping along the remaining dimension as needed. The end-to-end models are trained at a target resolution of 512 512. Image distribution metrics and Clip-Score are obtained using 30k prompts from the MSCOCO-captions validation set (Chen et al., 2015), while the semantic metrics are extracted on the 1k prompts from DSG-1k (Cho et al., 2024). |
| Hardware Specification | No | The paper does not explicitly mention the specific hardware used for running its experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies used in the implementation or experimentation. |
| Experiment Setup | Yes | To stress test the stability and convergence of our greedy growing algorithm, we restrict the batch size to 256 instead of the standard 2k, and we use no other explicit form of regularization. Models are trained for 2M steps in total. Our Shallow-UVi T results were obtained with guidance weights fixed at 1.75, and their corresponding UVi T models with guidance 4.0. Vermeer is an 8B parameter model grown from 256 to 1024 pixel resolution. The baseline version (Vermeer raw model) is trained with 2k batch size at 256 resolution for 2M iterations, and grown to 1k resolution and finetuned for an additional 1M steps. We then fine-tune for 8K steps with a mixture of the original data and the aesthetic subset. |