reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

Authors: Cristina Nader Vasconcelos, Abdullah Rashwan, Austin Waters, Trevor Walker, Keyang Xu, Jimmy Yan, Rui Qian, Yeqing Li, SHIXIN LUO, Yasumasa Onoe, Zarana Parekh, Ivana Kajic, Mandy Guo, Wenlei Zhou, Sarah Rosston, Roopal Garg, Hongliang Fei, Jordi Pont-Tuset, Su Wang, Henna Nandwani, Andrew Bunner, Kevin Swersky, David J. Fleet, Oliver Wang, Jason Michael Baldridge

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We ﬁrst demonstrate the beneﬁts of scaling a Shallow UNet, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024 1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL. The paper also includes extensive empirical evaluations on a publicly available dataset, namely, Conceptual 12M (or CC12M) (Changpinyo et al., 2021). The evaluation uses metrics such as FID, FD-Dino, CMMD, CLIPscore, and DSG, along with human evaluation studies.
Researcher Affiliation	Industry	Cristina N. Vasconcelos, Abdullah Rashwan, Austin Waters, Trevor Walker, Keyang Xu, Jimmy Yan, Rui Qian, Shixin Luo, Zarana Parekh, Andrew Bunner, Hongliang Fei, Roopal Garg, Mandy Guo, Ivana Kajic, Yeqing Li, Henna Nandwani, Jordi Pont-Tuset, Yasumasa Onoe, Sarah Rosston, Su Wang, Wenlei Zhou, Kevin Swersky, David J. Fleet, Jason M. Baldridge, Oliver Wang Google
Pseudocode	Yes	The greedy growing algorithm can be described as follows. Phase 1 In this phase, the core components of the chosen architecture are identiﬁed (see subsection 3.1), and a Shallow-UVi T model is build on top of them. The Shallow-UVi T is trained on the entire training collection of text-image pairs, as it is not limited to high resolution training images. Phase 2 The second phase greedily grows the Shallow-UVi T s encoder/decoder (namely, throwing away the lower-resolution blocks and adding higher-resolution blocks) to obtain the ﬁnal model. More speciﬁcally, this phase adds encoder and decoder layers at diﬀerent resolutions, while preserving the core representation layers at the spatial resolution used during the ﬁrst phase.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code or a link to a code repository.
Open Datasets	Yes	To avoid this issue, we ﬁrst demonstrate our key ﬁndings through extensive empirical evaluations on a publicly available dataset, namely, Conceptual 12M (or CC12M) (Changpinyo et al., 2021). Image distribution metrics and Clip-Score are obtained using 30k prompts from the MSCOCO-captions validation set (Chen et al., 2015), while the semantic metrics are extracted on the 1k prompts from DSG-1k (Cho et al., 2024). We consider a simple counting task, deﬁned here as the task of generating images of up to 5 objects based on a subset of text prompts from the numerical split of the Gecko benchmark (Wiles et al., 2024).
Dataset Splits	Yes	To ablate our hypothesis that greedy growing helps the model learn strong representations with larger, diverse corpora, we also train the full model on a high resolution subset of data used to train the Shallow-UVi T; i.e., we simply removed all samples with resolution lower than the target model resolution. To that end, beyond greedy growing, we explore the three training baselines: 1) We create a baseline with all layers trained from scratch on this subset; 2) As an alternative to the frozen phase in the greedy growing, we ﬁne-tune the core components on this smaller high resolution subset jointly with the grown components (randomly initialized); and 3) A third baseline adds the optional phase of unfreezing the core components after warming up the random weights for 500k steps. Models are trained for 2M steps in total. We train Shallow-UVi T on the entire CC12M training set, while corresponding end-to-end models were trained with CC12M s subset of 8.7M images whose dimensions are equal or larger than 512 pixels. Shallow-UVi T models were trained on 64 64 images, by resizing the smallest dimension of the images to 64 and random cropping along the remaining dimension as needed. The end-to-end models are trained at a target resolution of 512 512. Image distribution metrics and Clip-Score are obtained using 30k prompts from the MSCOCO-captions validation set (Chen et al., 2015), while the semantic metrics are extracted on the 1k prompts from DSG-1k (Cho et al., 2024).
Hardware Specification	No	The paper does not explicitly mention the specific hardware used for running its experiments, such as GPU or CPU models.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies used in the implementation or experimentation.
Experiment Setup	Yes	To stress test the stability and convergence of our greedy growing algorithm, we restrict the batch size to 256 instead of the standard 2k, and we use no other explicit form of regularization. Models are trained for 2M steps in total. Our Shallow-UVi T results were obtained with guidance weights ﬁxed at 1.75, and their corresponding UVi T models with guidance 4.0. Vermeer is an 8B parameter model grown from 256 to 1024 pixel resolution. The baseline version (Vermeer raw model) is trained with 2k batch size at 256 resolution for 2M iterations, and grown to 1k resolution and ﬁnetuned for an additional 1M steps. We then ﬁne-tune for 8K steps with a mixture of the original data and the aesthetic subset.