reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Do LLMs estimate uncertainty well in instruction-following?

Authors: Juyeon Heo, Miao Xiong, Christina Heinze-Deml, Jaya Narain

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present, to our knowledge, the first systematic evaluation of uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies key challenges with existing instruction-following benchmarks... Our findings show that existing uncertainty methods struggle... To evaluate how well existing uncertainty estimation methods and models perform on instructionfollowing tasks, we evaluate six uncertainty estimation methods across four LLMs on the IFEval benchmark dataset (Zhou et al., 2023).
Researcher Affiliation	Collaboration	Juyeon Heo1, Miao Xiong3, Christina Heinze-Deml2 Jaya Narain2 1University of Cambridge 2Apple 3National University of Singapore EMAIL EMAIL
Pseudocode	No	The paper describes various uncertainty estimation methods (Verbalized confidence, Normalized p(true), Perplexity, Sequence probability, Mean token entropy, Probing) in descriptive text, but does not present any of them as structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data are available at https://github.com/apple/ml-uncertainty-llms-instruction-following
Open Datasets	Yes	We evaluate uncertainty estimation with the IFEval dataset (Zhou et al., 2023)... Code and data are available at https://github.com/apple/ml-uncertainty-llms-instruction-following
Dataset Splits	Yes	We train a linear model on representations on instruction-following success labels... The model is trained for 1000 epochs on 70% training set and is evaluated on 30% test set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies	No	The paper mentions using Adam W for optimization but does not provide specific version numbers for software dependencies such as machine learning frameworks (e.g., PyTorch, TensorFlow) or other libraries used in the experimental setup.
Experiment Setup	Yes	We trained a linear model as an uncertainty estimation function... optimized with Adam W, a 0.001 learning rate, 0.1 weight decay. The model is trained for 1000 epochs on 70% training set and is evaluated on 30% test set.