ReDistill: Residual Encoded Distillation for Peak Memory Reduction of CNNs

Authors: Fang Chen, Gourav Datta, Mujahid Al Rafi, Hyeran Jeon, Meng Tang

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate our method s superior performance compared to other feature-based and response-based distillation methods when applied to the same student network. The code is available at https://github.com/mengtanglab/Re Distill.
Researcher Affiliation Academia Fang Chen1 EMAIL Gourav Datta2 EMAIL Mujahid Al Rafi1 EMAIL Hyeran Jeon1 EMAIL Meng Tang1 EMAIL 1University of California Merced 2Case Western Reserve University
Pseudocode No The paper includes diagrams (Figure 2, Figure 3) and textual descriptions of the method, but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/mengtanglab/Re Distill.
Open Datasets Yes 1) STL10 (Coates et al. (2011)) contains 5K training images with 10 classes and 8K testing images of resolution 96 96 pixels. 2) Image Net (Russakovsky et al. (2015)) is a widely-used dataset of classification, which provides 1.2 million images for training and 50K images for validation over 1,000 classes. 1) CIFAR-10 (Krizhevsky et al. (2009)) comprises 60,000 color images of 32x32 resolution across 10 classes, with each class containing 6,000 images. 2) Celeb-A (Liu et al. (2015)) is a large-scale face attributes dataset containing over 200,000 celebrity images, each annotated with 40 attributes.
Dataset Splits Yes 1) STL10 (Coates et al. (2011)) contains 5K training images with 10 classes and 8K testing images of resolution 96 96 pixels. 2) Image Net (Russakovsky et al. (2015)) is a widely-used dataset of classification, which provides 1.2 million images for training and 50K images for validation over 1,000 classes. 1) CIFAR-10 (Krizhevsky et al. (2009)) comprises 60,000 color images of 32x32 resolution across 10 classes, with each class containing 6,000 images. The dataset is divided into 50,000 training images and 10,000 test images.
Hardware Specification Yes All experiments are implemented in Pytorch and evaluated on 4 NVIDIA A100 GPUs. All experiments are implemented in Pytorch and evaluated on an NVIDIA 4090 GPU. In addition to the theoretical peak memory, we also measure the actual peak memory consumed on an NVIDIA Jetson TX2 device.
Software Dependencies No The paper mentions 'Pytorch' as the implementation framework and 'Torch AO' toolkit, but no specific version numbers are provided for these software dependencies, which is required for reproducibility.
Experiment Setup Yes On STL10 dataset... trained from scratch... for 300 epochs. The batch size is set to 8 and the dropout rate is set to 0.2. The SGD with momentum equal to 0.9 is used as the optimizer. The initial learning rate is set to 0.01, which is reduced by a factor 0.2 at the 180th, 240th and 270th epoch, respectively. The α in Equation 9 is set to 50. On Image Net dataset... keep training for 300 epochs and decay the learning rate at the 180th, 240th and 270th epoch with factor 0.1... The α in Equation 9 is set to be 1. For the teacher model, we keep the same experiment settings as DDPM (Ho et al. (2020)) with applying T = 1000, β1 = 10 4, βT = 0.02... For CIFAR-10 dataset, we train all models for 1000K iterations... For Celeb-A dataset, we train all models with 250K iterations...