Do LLMs estimate uncertainty well in instruction-following?
Authors: Juyeon Heo, Miao Xiong, Christina Heinze-Deml, Jaya Narain
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present, to our knowledge, the first systematic evaluation of uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies key challenges with existing instruction-following benchmarks... Our findings show that existing uncertainty methods struggle... To evaluate how well existing uncertainty estimation methods and models perform on instructionfollowing tasks, we evaluate six uncertainty estimation methods across four LLMs on the IFEval benchmark dataset (Zhou et al., 2023). |
| Researcher Affiliation | Collaboration | Juyeon Heo1, Miao Xiong3, Christina Heinze-Deml2 Jaya Narain2 1University of Cambridge 2Apple 3National University of Singapore EMAIL EMAIL |
| Pseudocode | No | The paper describes various uncertainty estimation methods (Verbalized confidence, Normalized p(true), Perplexity, Sequence probability, Mean token entropy, Probing) in descriptive text, but does not present any of them as structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data are available at https://github.com/apple/ml-uncertainty-llms-instruction-following |
| Open Datasets | Yes | We evaluate uncertainty estimation with the IFEval dataset (Zhou et al., 2023)... Code and data are available at https://github.com/apple/ml-uncertainty-llms-instruction-following |
| Dataset Splits | Yes | We train a linear model on representations on instruction-following success labels... The model is trained for 1000 epochs on 70% training set and is evaluated on 30% test set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its experiments. |
| Software Dependencies | No | The paper mentions using Adam W for optimization but does not provide specific version numbers for software dependencies such as machine learning frameworks (e.g., PyTorch, TensorFlow) or other libraries used in the experimental setup. |
| Experiment Setup | Yes | We trained a linear model as an uncertainty estimation function... optimized with Adam W, a 0.001 learning rate, 0.1 weight decay. The model is trained for 1000 epochs on 70% training set and is evaluated on 30% test set. |