High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

Authors: Muhammed Ildiz, Halil Gozeten, Ege Taga, Marco Mondelli, Samet Oymak

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our results on numerical experiments both on ridgeless regression and on neural network architectures. In Figure 2a, we examine the surrogate-to-target model in the context of image classification. Specifically, we fine-tune a pretrained Res Net-50 model (He et al., 2015) using both ground-truth labels and predictions from a surrogate (weak) model on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009).
Researcher Affiliation Academia M. Emrullah Ildiz Halil Alperen Gozeten Ege Onur Taga University of Michigan, Ann Arbor EMAIL, Marco Mondelli Institute of Science and Technology Austria EMAIL, Samet Oymak University of Michigan, Ann Arbor EMAIL
Pseudocode No The paper primarily focuses on theoretical derivations and analysis. It does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets Yes Specifically, we fine-tune a pretrained Res Net-50 model (He et al., 2015) using both ground-truth labels and predictions from a surrogate (weak) model on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009).
Dataset Splits Yes In the CIFAR-10 experiment, we initially trained the surrogate models on the training portion of the CIFAR-10 dataset. ... During testing, all models were evaluated using the test portion of the CIFAR-10 dataset.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types) used for running the experiments.
Software Dependencies No We initialize the optimizer for our model using stochastic gradient descent (SGD) provided by the optim module of Py Torch. (Lacks specific version numbers for PyTorch or other libraries).
Experiment Setup Yes The optimizer is configured with the following parameters: learning rate set to 0.01, momentum to 0.9, and weight decay to 5 10 4. Additionally, we define a learning rate scheduler, specifically a cosine annealing scheduler, which adjusts the learning rate using a cosine function over 200 iterations, denoted T_max. We use a batch size of 32 and trained all models over 60 epochs.