Ensemble Distribution Distillation via Flow Matching

Authors: Jonggeon Park, Giung Nam, Hyunsu Kim, Jongmin Yoon, Juho Lee

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments demonstrate the effectiveness of our proposed method compared to existing ensemble distillation approaches. Our extensive experiments on image classification and language tasks validate both the efficiency (5.4) and effectiveness (5.5 and 5.6) of the approach.
Researcher Affiliation Academia Jonggeon Park * 1 Giung Nam * 1 Hyunsu Kim 1 Jongmin Yoon 1 Juho Lee 1 *Equal contribution 1Korea Advanced Institute of Science and Technology, Republic of Korea. Correspondence to: Juho Lee <EMAIL>.
Pseudocode Yes Algorithm 1 Training EDFM. Algorithm 2 Sampling from EDFM.
Open Source Code No The paper does not provide an explicit statement or a direct link to the source code for the methodology described in this paper.
Open Datasets Yes CIFAR-10/100 (Krizhevsky and Hinton, 2009): ... The datasets are publicly available at http://www.cs.toronto.edu/~kriz/cifar.html under unspecified license. CINIC-10 (Darlow et al., 2018): ... The dataset is publicly available at https://github.com/BayesWatch/cinic-10 under MIT license. CIFAR-10.1 (Recht et al., 2018): ... The dataset is publicly available at https://github.com/modestyachts/CIFAR-10.1 under MIT license. CIFAR-10.2 (Lu et al., 2020): ... The dataset is publicly available at https://github.com/modestyachts/cifar-10.2 under unspecified license. STL (Coates et al., 2011): ... The dataset is publicly available at https://cs.stanford.edu/~acoates/stl10/ under unspecified license. SVHN (Netzer et al., 2011): ... The dataset is publicly available at http://ufldl.stanford.edu/housenumbers/ under unspecified license. ARC-Challenge (ARC-C; Clark et al., 2018): ... publicly available at https://huggingface.co/datasets/allenai/ai2_arc under CC-BY-SA-4.0 license. ARC-Easy (ARC-E; Clark et al., 2018): ... publicly available at https://huggingface.co/datasets/allenai/ai2_arc under CC-BY-SA-4.0 license. Open Book QA (OBQA; Mihaylov et al., 2018): ... publicly available at https://huggingface.co/datasets/allenai/openbookqa under unspecified license.
Dataset Splits Yes CIFAR-10/100 (Krizhevsky and Hinton, 2009): It consists of 40,960 training images, 9,040 validation images, and 10,000 test images... As there is no officially predefined validation split, we manually partitioned the 50,000 training examples into 40,960 for training and 9,040 for validation. ARC-Challenge (ARC-C; Clark et al., 2018): It consists of 1,117 questions for training and 295 questions for evaluation. ARC-Easy (ARC-E; Clark et al., 2018): It consists of 2,241 questions for training and 567 questions for evaluation. Open Book QA (OBQA; Mihaylov et al., 2018): It consists of 4,957 questions for training and 500 questions for evaluation.
Hardware Specification Yes On an RTX A6000, a Res Net processes a single input (batch size of 1) in 1.404 ms, while batch sizes of [32, 64, 128, 256] take [3.806, 6.993, 13.05, 24.68] ms, exhibiting a near-linear increase in our CIFAR-100 setup. This material is based upon work supported by the Google Cloud Research Credits program with the award GCP19980904 and Cloud TPUs from Google’s TPU Research Cloud (TRC).
Software Dependencies No The paper describes various models, optimizers, and architectures used (e.g., Res Net, LLaMA-2-7B, SGD, Adam), but does not provide specific version numbers for software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version) that would be needed for replication.
Experiment Setup Yes Multi-SWAG teacher: Each SWA model was pretrained for 800 epochs, followed by 200 epochs of SWA training with frequency 1. Using SGD optimizer with momentum 0.9, cosine decay schedule was applied for pretraining, in which the learning rate evolves from 0.1 to 0.01 and remains constant of 0.01 whilst SWA training. Batch size is 256. EDFM: We used the denoising MLP architecture with four blocks and a hidden dimension of 256 for CIFAR-10 and 512 for CIFAR-100. Evaluation was conducted with seven NFEs, and training was performed using the SGD optimizer with a batch size of 256, momentum of 0.9, weight decay of 5e-04, learning rates of 1e-04 for CIFAR-10 and 3e-04 for CIFAR-100, and a cosine decay learning rate schedule over 1,000 epochs. KD (Commonsense reasoning): The fine-tuning was carried out using the Adam optimizer with a batch size of 4, a maximum sequence length of 320, a learning rate of 3e-05, and a linear decay learning rate schedule over 50,000 steps.