reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Deconstructing Denoising Diffusion Models for Self-Supervised Learning

Authors: Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we examine the representation learning abilities of Denoising Diffusion Models (DDM) that were originally purposed for image generation. Our philosophy is to deconstruct a DDM, gradually transforming it into a classical Denoising Autoencoder (DAE). This deconstructive process allows us to explore how various components of modern DDMs influence self-supervised representation learning. We observe that only a very few modern components are critical for learning good representations, while many others are nonessential. Our study ultimately arrives at an approach that is highly simplified and to a large extent resembles a classical DAE. We hope our study will rekindle interest in a family of classical methods within the realm of modern self-supervised learning.
Researcher Affiliation	Collaboration	Xinlei Chen1 Zhuang Liu1,2 Saining Xie3 Kaiming He1,4 1FAIR, Meta 2Princeton University 3New York University 4MIT
Pseudocode	No	The paper describes methods and processes using mathematical equations and textual descriptions but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not contain an unambiguous statement of code release, nor does it provide a direct link to a source-code repository for the methodology described.
Open Datasets	Yes	We train Di T-L for 400 epochs on Image Net (Deng et al., 2009). Finally, we transfer the representations learned with l-DAE for object detection and segmentation. We use the Vi TDet framework (Li et al., 2022) and evaluate on the COCO benchmark (Lin et al., 2014) against MAE and supervised pre-training.
Dataset Splits	Yes	Our linear probing implementation follows the practice of MAE (He et al., 2022). We use clean, 256 256-sized images for linear probing training and evaluation. The Vi T output feature map is globally pooled by average pooling. It is then processed by a parameter-free Batch Norm (Ioffe & Szegedy, 2015) layer and a linear classifier layer, following He et al. (2022). The training batch size is 16384, learning rate is 6.4 10 3 (cosine decay schedule), weight decay is 0, and training length is 90 epochs. Randomly resized crop and flipping are used during training and a single center crop is used for testing. Top-1 accuracy is reported.
Hardware Specification	Yes	On a 256-core TPU-v3 pod, training Di T-L takes 12 hours.
Software Dependencies	No	The paper mentions using existing frameworks and practices (e.g., Di T architecture, MAE's linear probing protocol) but does not provide specific version numbers for software dependencies like programming languages or libraries.
Experiment Setup	Yes	We train Di T-L for 400 epochs on Image Net (Deng et al., 2009). The original Di Ts (Peebles & Xie, 2023) are trained with a batch size of 256. To speed up our exploration, we increase the batch size to 2048. We perform linear learning rate warm up (Goyal et al., 2017) for 100 epochs and then decay it following a half-cycle cosine schedule. We use a base learning rate blr = 1e-4 (Peebles & Xie, 2023) by default, and set the actual lr following the linear scaling rule (Goyal et al., 2017): blr batch size / 256. No weight decay is used (Peebles & Xie, 2023). We train for 400 epochs by default. ... Linear probing. ... training batch size is 16384, learning rate is 6.4 10 3 (cosine decay schedule), weight decay is 0, and training length is 90 epochs. ... End-to-end fine-tuning. ... training batch size is 1024, initial learning rate is 4 10 3, weight decay is 0.05, drop path (Huang et al., 2016) is 0.1, and training length is 100 epochs. We use a layerwise learning rate decay of 0.85 (B) or 0.65 (L). Mix Up (Zhang et al., 2018a) (0.8), Cut Mix (Yun et al., 2019) (1.0), Rand Aug (Cubuk et al., 2020) (9, 0.5), and exponential moving average (0.9999) are used, similar to He et al. (2022).