Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Calibrated Language Models and How to Find Them with Label Smoothing
Authors: Jerry Huang, Peng Lu, Qiuhao Zeng
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we examine various open-sourced LLMs, where we identify significant calibration degradation after instruction tuning. ... We conduct SFT training with and without LS on a Tulu3 dataset (Wang et al., 2023) for different pre-trained language model families, including Llama (Grattafiori et al., 2024), Gemma (Gemma Team, 2024) and Mistral (Jiang et al., 2023). ... Table 1 provides a comprehensive comparison of the accuracy and calibration performance of various large language models (LLMs) with and without label smoothing (LS) across different supervised fine-tuning (SFT) datasets. |
| Researcher Affiliation | Academia | 1Université de Montréal 2Mila Quebec AI Institute 3University of Western Ontario. Corresponding author: Peng Lu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Memory-efficient forward pass |
| Open Source Code | No | The paper does not provide an explicit statement from the authors releasing their code or a link to their repository for the custom kernel they developed. While it mentions using the 'open-instruct repository at commit e363290 for our training setup', this is a third-party tool they utilized, not their own implementation for the core methodology described. |
| Open Datasets | Yes | We conduct SFT training with and without LS on a Tulu3 dataset (Wang et al., 2023) for different pre-trained language model families, including Llama (Grattafiori et al., 2024), Gemma (Gemma Team, 2024) and Mistral (Jiang et al., 2023). ... The evaluation is conducted on three widely used benchmark datasets: MMLU, Hella Swag, and ARC-Easy. |
| Dataset Splits | Yes | All models are performed with a 5-shot evaluation. ... The evaluation is conducted on three widely used benchmark datasets: MMLU, Hella Swag, and ARC-Easy, ensuring a robust assessment of model performance. ... We follow MMLU and use the following prompt for all tasks: The following are multiple choice questions (with answers) about {}.\n\n .format(query). |
| Hardware Specification | Yes | Experiments are conducted on an H100-SXM5 GPU with 80GB of RAM, Py Torch 2.4.0 and CUDA 12.1. |
| Software Dependencies | Yes | Experiments are conducted using Py Torch 2.4.0 and CUDA 12.1. ... We used the open-instruct repository at commit e363290 for our training setup. |
| Experiment Setup | Yes | We employ the Adam W optimizer for training and conduct a grid search over the learning rates {5e 6, 2e 5, 5e 5, 2e 4} to determine the optimal setting for each model. To facilitate stable training and prevent over-fitting, we use a batch size of 128 and apply a dropout rate of 0.1. ... We further tested label smoothing hyper-parameters [0.0, 0.1, 0.2, 0.3, 0.4, 0.5], where 0.0 is no smoothing. |