reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Calibrated Language Models and How to Find Them with Label Smoothing

Authors: Jerry Huang, Peng Lu, Qiuhao Zeng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we examine various open-sourced LLMs, where we identify significant calibration degradation after instruction tuning. ... We conduct SFT training with and without LS on a Tulu3 dataset (Wang et al., 2023) for different pre-trained language model families, including Llama (Grattafiori et al., 2024), Gemma (Gemma Team, 2024) and Mistral (Jiang et al., 2023). ... Table 1 provides a comprehensive comparison of the accuracy and calibration performance of various large language models (LLMs) with and without label smoothing (LS) across different supervised fine-tuning (SFT) datasets.
Researcher Affiliation	Academia	1Université de Montréal 2Mila Quebec AI Institute 3University of Western Ontario. Corresponding author: Peng Lu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Memory-efficient forward pass
Open Source Code	No	The paper does not provide an explicit statement from the authors releasing their code or a link to their repository for the custom kernel they developed. While it mentions using the 'open-instruct repository at commit e363290 for our training setup', this is a third-party tool they utilized, not their own implementation for the core methodology described.
Open Datasets	Yes	We conduct SFT training with and without LS on a Tulu3 dataset (Wang et al., 2023) for different pre-trained language model families, including Llama (Grattafiori et al., 2024), Gemma (Gemma Team, 2024) and Mistral (Jiang et al., 2023). ... The evaluation is conducted on three widely used benchmark datasets: MMLU, Hella Swag, and ARC-Easy.
Dataset Splits	Yes	All models are performed with a 5-shot evaluation. ... The evaluation is conducted on three widely used benchmark datasets: MMLU, Hella Swag, and ARC-Easy, ensuring a robust assessment of model performance. ... We follow MMLU and use the following prompt for all tasks: The following are multiple choice questions (with answers) about {}.\n\n .format(query).
Hardware Specification	Yes	Experiments are conducted on an H100-SXM5 GPU with 80GB of RAM, Py Torch 2.4.0 and CUDA 12.1.
Software Dependencies	Yes	Experiments are conducted using Py Torch 2.4.0 and CUDA 12.1. ... We used the open-instruct repository at commit e363290 for our training setup.
Experiment Setup	Yes	We employ the Adam W optimizer for training and conduct a grid search over the learning rates {5e 6, 2e 5, 5e 5, 2e 4} to determine the optimal setting for each model. To facilitate stable training and prevent over-fitting, we use a batch size of 128 and apply a dropout rate of 0.1. ... We further tested label smoothing hyper-parameters [0.0, 0.1, 0.2, 0.3, 0.4, 0.5], where 0.0 is no smoothing.