reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Glauber Generative Model: Discrete Diffusion Models via Binary Classification

Authors: Harshit Varma, Dheeraj Nagaraj, Karthikeyan Shanmugam

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply GGM to language modeling and image generation, where images are discretized using image tokenizers like VQGANs. We show that it outperforms existing discrete diffusion models in language generation, and demonstrates strong performance for image generation without using dataset-specific image tokenizers. We also show that our model is capable of performing well in zero-shot control settings like text and image infilling.
Researcher Affiliation	Industry	Harshit Varma Inception Labs EMAIL Dheeraj Nagaraj Google Deep Mind EMAIL Karthikeyan Shanmugam Google Deep Mind EMAIL
Pseudocode	Yes	Algorithm 1: Training a Glauber Generative Model (GGM) ... Algorithm 2: Inference from a Glauber Generative Model (GGM)
Open Source Code	No	The paper does not provide an explicit statement or direct link to its own source code for the methodology described. It references source code for baselines and a third-party FID implementation, but not for GGM itself.
Open Datasets	Yes	We train our models on the Open Web Text dataset (Gokaslan & Cohen, 2019) and evaluate on language generation. ... We report the FID (Parmar et al., 2022) values in Table 2 for unconditional 256 × 256 image synthesis on the Celeb A-HQ dataset (Karras et al., 2017) and the FFHQ dataset (Karras et al., 2018). ... Footnote 3: Celeb A-HQ (CC BY-NC 4 License), FFHQ (CC BY-NC-SA 4.0 License). Footnote 4: An open-source replica of the unreleased Web Text dataset that was used to train GPT2: Skylion007/openwebtext (CC0 1.0 Universal License)
Dataset Splits	No	The paper mentions training on Open Web Text, Celeb A-HQ, and FFHQ, and evaluating on unconditional generations. However, it does not provide specific training/test/validation dataset splits (e.g., percentages or sample counts) for these datasets as typically required for reproducing experiments, especially in a supervised learning context. For generative models, evaluation is often on generated samples rather than a held-out test split of the original data.
Hardware Specification	Yes	We train our models on TPUv5e accelerators having a 16 × 16 topology with data-parallelism enabled via Pathways (Barham et al., 2022) and dataloading via Py Grain2 and TFDS3.
Software Dependencies	No	Our model and related code is written in JAX (Bradbury et al., 2018) and Flax (Heek et al., 2023). ... We use the Adam W optimizer (Loshchilov & Hutter, 2017). ... We use the T5 tokenizer5. ... We use the Huggingface Flax implementations6 for all the BERT and GPT* architectures used in this paper. ... Our FID implementation is in JAX8, and following Clean-FID (Parmar et al., 2022). The paper mentions software and libraries like JAX, Flax, Adam W, T5 tokenizer, Huggingface Flax implementations, and Clean-FID, but it does not specify version numbers for these components, which are crucial for reproducibility.
Experiment Setup	Yes	Our model is a 24 layer transformer model based on (Peebles & Xie, 2022; Lou et al., 2023) with 16 attention heads with a hidden size of 1024. ... The number of timesteps T and the sequence length L fixed to 4096 and 1024 respectively. Πt = Π for all t, and Π(ϕ) = 0.5. We use the Adam W optimizer (Loshchilov & Hutter, 2017) (with β1 = 0.9, β2 = 0.999, ϵ = 10−8) with no weight decay and with no dropout and use EMA with 0.9999 over all training steps during inference. ... We use a batch size of 64 and sample 32 timesteps per example in every iteration leading to an effective batch size of 2048. Our model has been trained on Open Web Text4 (Gokaslan & Cohen, 2019) for 2M steps. ... We keep the initial learning rate to 0 and warm it up linearly for 8000 steps to a peak learning rate of 10−4 and then decay it to 10−6 using a cosine decay schedule over 2M steps. ... Our model on Celeb A-HQ (Karras et al., 2017) has been trained for 1.45M steps and our model on FFHQ (Karras et al., 2018) have been trained for 3M steps. ... During inference, for both Celeb A-HQ and FFHQ, we use top-p sampling with p = 0.9. Additionally, for Celeb A-HQ we use a temperature of 1.05 for the first half of the denoising steps.