High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws
Authors: Muhammed Ildiz, Halil Gozeten, Ege Taga, Marco Mondelli, Samet Oymak
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our results on numerical experiments both on ridgeless regression and on neural network architectures. In Figure 2a, we examine the surrogate-to-target model in the context of image classification. Specifically, we fine-tune a pretrained Res Net-50 model (He et al., 2015) using both ground-truth labels and predictions from a surrogate (weak) model on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009). |
| Researcher Affiliation | Academia | M. Emrullah Ildiz Halil Alperen Gozeten Ege Onur Taga University of Michigan, Ann Arbor EMAIL, Marco Mondelli Institute of Science and Technology Austria EMAIL, Samet Oymak University of Michigan, Ann Arbor EMAIL |
| Pseudocode | No | The paper primarily focuses on theoretical derivations and analysis. It does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the described methodology, nor does it provide any links to a code repository. |
| Open Datasets | Yes | Specifically, we fine-tune a pretrained Res Net-50 model (He et al., 2015) using both ground-truth labels and predictions from a surrogate (weak) model on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009). |
| Dataset Splits | Yes | In the CIFAR-10 experiment, we initially trained the surrogate models on the training portion of the CIFAR-10 dataset. ... During testing, all models were evaluated using the test portion of the CIFAR-10 dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types) used for running the experiments. |
| Software Dependencies | No | We initialize the optimizer for our model using stochastic gradient descent (SGD) provided by the optim module of Py Torch. (Lacks specific version numbers for PyTorch or other libraries). |
| Experiment Setup | Yes | The optimizer is configured with the following parameters: learning rate set to 0.01, momentum to 0.9, and weight decay to 5 10 4. Additionally, we define a learning rate scheduler, specifically a cosine annealing scheduler, which adjusts the learning rate using a cosine function over 200 iterations, denoted T_max. We use a batch size of 32 and trained all models over 60 epochs. |