LumiNet: Perception-Driven Knowledge Distillation via Statistical Logit Calibration
Authors: Md. Ismail Hossain, M M Lutfe Elahi, Sameera Ramasinghe, Ali Cheraghian, Fuad Rahman, Nabeel Mohammed, Shafin Rahman
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations confirm Lumi Net s effectiveness, demonstrating substantial accuracy improvements for instance, boosting Res Net8 4 performance on CIFAR-100 from 73.3% to 77.5% while ensuring practical efficiency and applicability for real-world scenarios. 4 Experiments: Dataset: Using benchmark datasets, we conducted experiments on three vision tasks: image classification, object detection, and transfer learning. Our experiments leveraged four widely acknowledged benchmark datasets. First, CIFAR-100 (Krizhevsky et al., 2009), encapsulating a compact yet comprehensive representation of images, comprises 60,000 32x32 resolution images, segregated into 100 classes with 600 images per class. Image Net (Russakovsky et al., 2015), a more extensive dataset, provides a rigorous testing ground with its collection of over a million images distributed across 1,000 diverse classes, often utilized to probe models for robustness and generalization. Concurrently, the MS COCO dataset (Lin et al., 2014), renowned for its rich annotations, is pivotal for intricate tasks, facilitating both object detection and segmentation assessments with 330K images, 1.5 million object instances, and 80 object categories. |
| Researcher Affiliation | Collaboration | Md. Ismail Hossain EMAIL Apurba-NSU R&D Lab, North South University, Bangladesh Sameera Ramasinghe EMAIL Pluralis Research Fuad Rahman EMAIL Apurba Technologies, Sunnyvale, CA 94085, USA |
| Pseudocode | No | The paper describes the methodology with equations and textual descriptions, for example, under Section 3.2 "Introducing Lumi Net" and "Constructing the perception", but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code. |
| Open Source Code | No | The paper does not contain an explicit statement from the authors about releasing their own source code, nor does it provide a direct link to a code repository for the methodology described. It only refers to implementation settings from other papers and external resources for datasets. |
| Open Datasets | Yes | Dataset: Using benchmark datasets, we conducted experiments on three vision tasks: image classification, object detection, and transfer learning. Our experiments leveraged four widely acknowledged benchmark datasets. First, CIFAR-100 (Krizhevsky et al., 2009), ... Image Net (Russakovsky et al., 2015), ... MS COCO dataset (Lin et al., 2014), ... The Tiny Image Net1 dataset, ... Beyond vision applications, we also adapted this method to the GLUE Benchmark and small language models. We used the dataset split within this space for our experiment 2. We have used 13.5k samples from the Dolly dataset for fine-tuning, while 500 samples were reserved for testing. Additionally, 80 and 240 samples were used from Vicuna and Self Inst, respectively for evaluation. |
| Dataset Splits | Yes | We strictly adhered to standard dataset splits for reproducibility and benchmarking compatibility for training, validation, and testing. We have used 13.5k samples from the Dolly dataset for fine-tuning, while 500 samples were reserved for testing. Additionally, 80 and 240 samples were used from Vicuna and Self Inst, respectively for evaluation. |
| Hardware Specification | Yes | All models are trained on a single GPU. The training was performed on a single 4090 GPU. |
| Software Dependencies | No | The paper mentions using Adam W optimizer but does not specify versions for core software libraries like Python, PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | For training a student model on the CIFAR-100 dataset, we use a batch size of 64 and train for a total of 240 epochs. The initial learning rate (LR) is set to 0.05, with learning rate decay applied at epochs 150, 180, and 210, where the LR is reduced by a factor of 0.1 each time. We employ a weight decay of 0.0005 and a momentum of 0.9 in our stochastic gradient descent (SGD) optimizer. When training on the Image Net dataset, we use a batch size of 512 and train for a total of 100 epochs. The initial LR is set to 0.2, with learning rate decay scheduled at epochs 30, 60, and 90, where the LR is decreased by a factor of 0.1 each time. We apply a weight decay of 0.0001 and utilize a momentum of 0.9 in the SGD optimizer. For training object detection student models on the MS-COCO 2017 dataset, we use an image per batch of 8. The base learning rate is set to 0.01, and the maximum number of iterations is set to 180,000. Learning rate decay is applied at specific steps during training, with decay steps set at 120,000 and 160,000 iterations. For Vision Transformer, ... We set the dropout rate to 0.0, the drop path rate to 0.1, and the attention dropout rate to 0.0. For optimization, we use the Adam W optimizer with a base learning rate of 5.0 10 4 and a minimum learning rate of 5.0 10 6. The learning rate policy is cosine annealing (cos) with a maximum of 300 epochs. We apply a weight decay of 0.05, a warm-up factor of 0.001, and warm-up epochs of 20. For LLMs, ... The training was conducted with a batch size of 2. We implemented sequence-level tokenization and used the Adam W optimizer with a learning rate of 5e-5. |