Do LLMs estimate uncertainty well in instruction-following?

Authors: Juyeon Heo, Miao Xiong, Christina Heinze-Deml, Jaya Narain

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present, to our knowledge, the first systematic evaluation of uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies key challenges with existing instruction-following benchmarks... Our findings show that existing uncertainty methods struggle... To evaluate how well existing uncertainty estimation methods and models perform on instructionfollowing tasks, we evaluate six uncertainty estimation methods across four LLMs on the IFEval benchmark dataset (Zhou et al., 2023).
Researcher Affiliation Collaboration Juyeon Heo1, Miao Xiong3, Christina Heinze-Deml2 Jaya Narain2 1University of Cambridge 2Apple 3National University of Singapore EMAIL EMAIL
Pseudocode No The paper describes various uncertainty estimation methods (Verbalized confidence, Normalized p(true), Perplexity, Sequence probability, Mean token entropy, Probing) in descriptive text, but does not present any of them as structured pseudocode or algorithm blocks.
Open Source Code Yes Code and data are available at https://github.com/apple/ml-uncertainty-llms-instruction-following
Open Datasets Yes We evaluate uncertainty estimation with the IFEval dataset (Zhou et al., 2023)... Code and data are available at https://github.com/apple/ml-uncertainty-llms-instruction-following
Dataset Splits Yes We train a linear model on representations on instruction-following success labels... The model is trained for 1000 epochs on 70% training set and is evaluated on 30% test set.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies No The paper mentions using Adam W for optimization but does not provide specific version numbers for software dependencies such as machine learning frameworks (e.g., PyTorch, TensorFlow) or other libraries used in the experimental setup.
Experiment Setup Yes We trained a linear model as an uncertainty estimation function... optimized with Adam W, a 0.001 learning rate, 0.1 weight decay. The model is trained for 1000 epochs on 70% training set and is evaluated on 30% test set.