Decoder Denoising Pretraining for Semantic Segmentation

Authors: Emmanuel Asiedu Brempong, Simon Kornblith, Ting Chen, Niki Parmar, Matthias Minderer, Mohammad Norouzi

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a decoder pretraining approach based on denoising, which can be combined with supervised pretraining of the encoder. We find that decoder denoising pretraining on the Image Net dataset strongly outperforms encoder-only supervised pretraining. Despite its simplicity, decoder denoising pretraining achieves state-of-the-art results on label-efficient semantic segmentation and offers considerable gains on the Cityscapes, Pascal Context, and ADE20K datasets.
Researcher Affiliation Industry Emmanuel Asiedu Brempong EMAIL Google Research Simon Kornblith EMAIL Google Research Ting Chen EMAIL Google Research Niki Parmar EMAIL Google Research Matthias Minderer EMAIL Google Research Mohammad Norouzi EMAIL Google Research
Pseudocode No The paper describes the methodology using mathematical equations and textual explanations, but it does not include any distinct pseudocode blocks or algorithms.
Open Source Code No The paper does not explicitly state that source code for the described methodology is being released, nor does it provide a link to a code repository. The 'Reviewed on Open Review' link is for the review process, not code.
Open Datasets Yes The encoder is pre-trained on Image Net-21k (Deng et al., 2009) classification... After pretraining, the model is fine-tuned on the Cityscapes, Pascal Context, or ADE20K semantic segmentation datasets (Cordts et al., 2016; Mottaghi et al., 2014; Zhou et al., 2018).
Dataset Splits Yes Right: Mean Io U on the Pascal Context dataset as a function of fraction of labeled training images available. Decoder denoising pretraining is particularly effective when a small number of labeled images is available, but continues to outperform supervised pretraining even on the full dataset. For the 100% setting, we report the means of 10 runs on all of the datasets. On Pascal Context and ADE20K, we also report the mean of 10 runs (with different subsets) for the 1%, 5% and 10% label fractions and 5 runs for the 20% setting. On Cityscapes, we report the mean of 10 runs for the 1/30 setting, 6 runs for the 1/8 setting and 4 runs for the 1/4 setting.
Hardware Specification Yes Indeed, training DDe P costs 117.6 PFLOPs compared to 48.3 PFLOPs for the supervised baseline on 32 TPU-v4 chips.
Software Dependencies No The paper mentions using the Adam optimizer, but does not provide specific version numbers for any software libraries, frameworks (e.g., TensorFlow, PyTorch), or programming languages used for implementation.
Experiment Setup Yes For downstream fine-tuning of the pretrained models for the semantic segmentation task, we use the standard pixel-wise cross-entropy loss. We use the Adam (Kingma & Ba, 2015) optimizer with a cosine learning rate decay schedule. For Decoder Denoising Pretraining (DDe P), we use a batch size of 512 and train for 100 epochs. The learning rate is 6e 5 for the 1 and 3 width decoders, and 1e 4 for the 2 width decoder. When fine-tuning the pretrained models on the target semantic segmentation task, we sweep over weight decay and learning rate values between [1e 5, 3e 4] and choose the best combination for each task. During training, random cropping and random left-right flipping is applied to the images and their corresponding segmentation masks. We randomly crop the images to a fixed size of 1024 1024 for Cityscapes and 512 512 for ADE20K and Pascal Context. All of the decoder denoising pretraining runs are conducted at a 224 224 resolution.