Glauber Generative Model: Discrete Diffusion Models via Binary Classification
Authors: Harshit Varma, Dheeraj Nagaraj, Karthikeyan Shanmugam
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply GGM to language modeling and image generation, where images are discretized using image tokenizers like VQGANs. We show that it outperforms existing discrete diffusion models in language generation, and demonstrates strong performance for image generation without using dataset-specific image tokenizers. We also show that our model is capable of performing well in zero-shot control settings like text and image infilling. |
| Researcher Affiliation | Industry | Harshit Varma Inception Labs EMAIL Dheeraj Nagaraj Google Deep Mind EMAIL Karthikeyan Shanmugam Google Deep Mind EMAIL |
| Pseudocode | Yes | Algorithm 1: Training a Glauber Generative Model (GGM) ... Algorithm 2: Inference from a Glauber Generative Model (GGM) |
| Open Source Code | No | The paper does not provide an explicit statement or direct link to its own source code for the methodology described. It references source code for baselines and a third-party FID implementation, but not for GGM itself. |
| Open Datasets | Yes | We train our models on the Open Web Text dataset (Gokaslan & Cohen, 2019) and evaluate on language generation. ... We report the FID (Parmar et al., 2022) values in Table 2 for unconditional 256 × 256 image synthesis on the Celeb A-HQ dataset (Karras et al., 2017) and the FFHQ dataset (Karras et al., 2018). ... Footnote 3: Celeb A-HQ (CC BY-NC 4 License), FFHQ (CC BY-NC-SA 4.0 License). Footnote 4: An open-source replica of the unreleased Web Text dataset that was used to train GPT2: Skylion007/openwebtext (CC0 1.0 Universal License) |
| Dataset Splits | No | The paper mentions training on Open Web Text, Celeb A-HQ, and FFHQ, and evaluating on unconditional generations. However, it does not provide specific training/test/validation dataset splits (e.g., percentages or sample counts) for these datasets as typically required for reproducing experiments, especially in a supervised learning context. For generative models, evaluation is often on generated samples rather than a held-out test split of the original data. |
| Hardware Specification | Yes | We train our models on TPUv5e accelerators having a 16 × 16 topology with data-parallelism enabled via Pathways (Barham et al., 2022) and dataloading via Py Grain2 and TFDS3. |
| Software Dependencies | No | Our model and related code is written in JAX (Bradbury et al., 2018) and Flax (Heek et al., 2023). ... We use the Adam W optimizer (Loshchilov & Hutter, 2017). ... We use the T5 tokenizer5. ... We use the Huggingface Flax implementations6 for all the BERT and GPT* architectures used in this paper. ... Our FID implementation is in JAX8, and following Clean-FID (Parmar et al., 2022). The paper mentions software and libraries like JAX, Flax, Adam W, T5 tokenizer, Huggingface Flax implementations, and Clean-FID, but it does not specify version numbers for these components, which are crucial for reproducibility. |
| Experiment Setup | Yes | Our model is a 24 layer transformer model based on (Peebles & Xie, 2022; Lou et al., 2023) with 16 attention heads with a hidden size of 1024. ... The number of timesteps T and the sequence length L fixed to 4096 and 1024 respectively. Πt = Π for all t, and Π(ϕ) = 0.5. We use the Adam W optimizer (Loshchilov & Hutter, 2017) (with β1 = 0.9, β2 = 0.999, ϵ = 10−8) with no weight decay and with no dropout and use EMA with 0.9999 over all training steps during inference. ... We use a batch size of 64 and sample 32 timesteps per example in every iteration leading to an effective batch size of 2048. Our model has been trained on Open Web Text4 (Gokaslan & Cohen, 2019) for 2M steps. ... We keep the initial learning rate to 0 and warm it up linearly for 8000 steps to a peak learning rate of 10−4 and then decay it to 10−6 using a cosine decay schedule over 2M steps. ... Our model on Celeb A-HQ (Karras et al., 2017) has been trained for 1.45M steps and our model on FFHQ (Karras et al., 2018) have been trained for 3M steps. ... During inference, for both Celeb A-HQ and FFHQ, we use top-p sampling with p = 0.9. Additionally, for Celeb A-HQ we use a temperature of 1.05 for the first half of the denoising steps. |