Deconstructing Denoising Diffusion Models for Self-Supervised Learning
Authors: Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we examine the representation learning abilities of Denoising Diffusion Models (DDM) that were originally purposed for image generation. Our philosophy is to deconstruct a DDM, gradually transforming it into a classical Denoising Autoencoder (DAE). This deconstructive process allows us to explore how various components of modern DDMs influence self-supervised representation learning. We observe that only a very few modern components are critical for learning good representations, while many others are nonessential. Our study ultimately arrives at an approach that is highly simplified and to a large extent resembles a classical DAE. We hope our study will rekindle interest in a family of classical methods within the realm of modern self-supervised learning. |
| Researcher Affiliation | Collaboration | Xinlei Chen1 Zhuang Liu1,2 Saining Xie3 Kaiming He1,4 1FAIR, Meta 2Princeton University 3New York University 4MIT |
| Pseudocode | No | The paper describes methods and processes using mathematical equations and textual descriptions but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not contain an unambiguous statement of code release, nor does it provide a direct link to a source-code repository for the methodology described. |
| Open Datasets | Yes | We train Di T-L for 400 epochs on Image Net (Deng et al., 2009). Finally, we transfer the representations learned with l-DAE for object detection and segmentation. We use the Vi TDet framework (Li et al., 2022) and evaluate on the COCO benchmark (Lin et al., 2014) against MAE and supervised pre-training. |
| Dataset Splits | Yes | Our linear probing implementation follows the practice of MAE (He et al., 2022). We use clean, 256 256-sized images for linear probing training and evaluation. The Vi T output feature map is globally pooled by average pooling. It is then processed by a parameter-free Batch Norm (Ioffe & Szegedy, 2015) layer and a linear classifier layer, following He et al. (2022). The training batch size is 16384, learning rate is 6.4 10 3 (cosine decay schedule), weight decay is 0, and training length is 90 epochs. Randomly resized crop and flipping are used during training and a single center crop is used for testing. Top-1 accuracy is reported. |
| Hardware Specification | Yes | On a 256-core TPU-v3 pod, training Di T-L takes 12 hours. |
| Software Dependencies | No | The paper mentions using existing frameworks and practices (e.g., Di T architecture, MAE's linear probing protocol) but does not provide specific version numbers for software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | We train Di T-L for 400 epochs on Image Net (Deng et al., 2009). The original Di Ts (Peebles & Xie, 2023) are trained with a batch size of 256. To speed up our exploration, we increase the batch size to 2048. We perform linear learning rate warm up (Goyal et al., 2017) for 100 epochs and then decay it following a half-cycle cosine schedule. We use a base learning rate blr = 1e-4 (Peebles & Xie, 2023) by default, and set the actual lr following the linear scaling rule (Goyal et al., 2017): blr batch size / 256. No weight decay is used (Peebles & Xie, 2023). We train for 400 epochs by default. ... Linear probing. ... training batch size is 16384, learning rate is 6.4 10 3 (cosine decay schedule), weight decay is 0, and training length is 90 epochs. ... End-to-end fine-tuning. ... training batch size is 1024, initial learning rate is 4 10 3, weight decay is 0.05, drop path (Huang et al., 2016) is 0.1, and training length is 100 epochs. We use a layerwise learning rate decay of 0.85 (B) or 0.65 (L). Mix Up (Zhang et al., 2018a) (0.8), Cut Mix (Yun et al., 2019) (1.0), Rand Aug (Cubuk et al., 2020) (9, 0.5), and exponential moving average (0.9999) are used, similar to He et al. (2022). |