Semantic Self-adaptation: Enhancing Generalization with a Single Sample

Authors: Sherwin Bahmani, Oliver Hahn, Eduard Zamfir, Nikita Araslanov, Daniel Cremers, Stefan Roth

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical study suggests that self-adaptation may complement the established practice of model regularization at training time for improving deep network generalization to out-of-domain data. Our code and pre-trained models are available at https://github.com/visinf/self-adaptive.
Researcher Affiliation Academia Sherwin Bahmani 1 Oliver Hahn 1 Eduard Zamfir 2 Nikita Araslanov 3 Daniel Cremers3 Stefan Roth1,4 1TU Darmstadt 2University of Würzburg 3TU Munich 4hessian.AI
Pseudocode Yes Algorithm 1: Summary of self-adaptation. 1 Train segmentation model on source data (best-practice, established methodology) 2 Replace Batch Norm with Sa N (cf. Sec. 3.1) 3 Tune hyperparameter α in Sa N on validation set (Wild Dash) 4 # Inference on any dataset. Initial model parameters: θ0. 5 foreach test sample do 6 Obtain θ by minimizing cross-entropy w.r.t. pseudo-labels in Eq. (4). 7 Predict segmentation for the test sample using θ . 8 Reset model parameters to θ0.
Open Source Code Yes Our code and pre-trained models are available at https://github.com/visinf/self-adaptive.
Open Datasets Yes Source data. We train our model on the training split of two synthetic datasets (mutually exclusive) with low-cost ground truth annotation: GTA (Richter et al., 2016) and SYNTHIA (Ros et al., 2016). Validation set. For model selection and hyperparameter tuning, we use the validation set of Wild Dash (Zendel et al., 2018). Multi-target evaluation. Following model selection, we evaluate the single model on three target domains comprising the validation sets from Cityscapes (Cordts et al., 2016), BDD (Yu et al., 2020), and IDD (Varma et al., 2019). To compare to previous works, we also evaluate on Mapillary (Neuhold et al., 2017). We use Res Net-50 and Res Net-101 (He et al., 2016) pre-trained on Image Net (Deng et al., 2009) as backbone. The ACDC dataset (Sakaridis et al., 2021) offers densely labeled driving scenes under adverse weather conditions such as fog, rain, and snow.
Dataset Splits Yes GTA. GTA (Richter et al., 2016) is a street view dataset generated semi-automatically from the computer game Grand Theft Auto V. The dataset consists of 12,403 training images, 6,382 validation images, and 6,181 testing images of resolution 1914 1052 with 19 different semantic classes. SYNTHIA. We use the SYNTHIA-RAND-CITYSCAPES subset of the synthetic dataset SYNTHIA (Ros et al., 2016), which contains 9,400 images, and has 16 semantic classes in common with GTA. Cityscapes. Cityscapes (Cordts et al., 2016) is an ego-centric street-scene dataset and contains 5,000 high-resolution images with 2048 1024 pixels. It is split into 2,975 train, 500 val, and 1,525 test images with 19 semantic classes being annotated. BDD. BDD (Yu et al., 2020) is a driving video dataset, which also contains semantic labelings with the identical 19 classes as in the other datasets. Images have a resolution of 1280 720 pixels. The training, validation, and test sets contain 7,000, 1,000, and 2,000 images, respectively. IDD. IDD (Varma et al., 2019) is a dataset for road scene understanding in unstructured environments. It contains 10,003 images annotated with 34 classes even though we only evaluate on the 19 classes overlapping with the other datasets. IDD is split into 6,993 training images, 981 validation images, and 2,029 test images. Mapillary. Annotations from Mapillary (Neuhold et al., 2017) contain 66 object classes; analogously to IDD we only evaluate on the 19 classes overlapping with the other datasets. The dataset is split into a training set with 18,000 images and a validation set with 2,000 images with a minimum resolution of 1920 1080 pixels.
Hardware Specification Yes We train our models with Sync BN (Paszke et al., 2019) on two NVIDIA Ge Force RTX 2080 GPUs. We obtain these results by running inference on a single NVIDIA Ge Force RTX 2080 GPU.
Software Dependencies No We implement our framework in Py Torch (Paszke et al., 2019). Our code and pre-trained models are publicly available. We also discuss trivial to implement, but crucially useful training details of our baseline in-depth here and in Appendix A. We use Pillow library (https://pillow.readthedocs.io) to implement photometric augmentation.
Experiment Setup Yes We minimize the cross-entropy loss with an SGD optimizer and a learning rate of 0.005, decayed polynomially with the power set to 0.9. All models are trained on the source domains for 50 epochs with batch size, momentum, and weight decay set to 4, 0.9, and 0.0001, respectively. For data augmentation, we compute crops of random size (0.08 to 1.0) of the original image size, apply a random aspect ratio (3/4 to 4/3) to the crop, and resize the result to 512 512 pixels. We also use random horizontal flipping, color jitter, random blur, and grayscaling.