Calibrating LLMs with Information-Theoretic Evidential Deep Learning

Authors: Yawei Li, David Rügamer, Bernd Bischl, Mina Rezaei

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across various fine-tuned LLMs and tasks demonstrate that IB-EDL outperforms both existing EDL and non-EDL approaches. By improving the trustworthiness of LLMs, IB-EDL facilitates their broader adoption in domains requiring high levels of confidence calibration.
Researcher Affiliation Academia Yawei Li David R ugamer Bernd Bischl Mina Rezaei Department of Statistics, LMU Munich Munich Center for Machine Learning (MCML)
Pseudocode Yes Algorithm 1 IB-EDL training and inference pseudocode.
Open Source Code Yes Code is available at https://github.com/sandylaker/ib-edl.
Open Datasets Yes We compare methods on six multiple-choice classification datasets, including five for commonsense reasoning, ARC-C and ARC-E (Clark et al., 2018), Openbook QA (OBQA) (Mihaylov et al., 2018), Commonsense QA (CSQA) (Talmor et al., 2019), and Sci Q (Welbl et al., 2017), alongside a dataset for reading comprehension, RACE (Lai et al., 2017).
Dataset Splits No The paper uses standard benchmark datasets but does not explicitly state the train/validation/test split percentages, sample counts, or a citation for the specific splits used in their experiments for each dataset. It describes how datasets are used for in-distribution (ID) and out-of-distribution (OOD) settings (e.g., "fine-tune the LLMs on OBQA (as the ID dataset) and test them on ARC-C, ARC-E, and CSQA (as OOD dataset)") and mentions perturbing 30% of labels in the training set, but this does not constitute a full specification of data partitioning for reproduction.
Hardware Specification Yes All experiments are conducted on a single NVIDIA H100 GPU.
Software Dependencies No The paper mentions using "PEFT" and "Transformers" libraries, and "PyTorch" for parallelization, but it does not provide specific version numbers for any of these software components.
Experiment Setup Yes We used Dropout with a dropout rate of p = 0.1, Lo RA α = 16, rank r = 8, and set bias = "lora only". all models were trained for 30000 steps on the CSQA dataset and 10080 steps on the other datasets. The learning rate was set to 0.00005 and annealed using a cosine schedule. The maximum token length was set to 300 for the RACE dataset and 256 for all other datasets. Training was conducted with bfloat16 precision. For MCD (Gal & Ghahramani, 2016), we performed 10 forward passes. For Ens (Lakshminarayanan et al., 2017; Fort et al., 2019), we used predictions from 3 models. For EDL methods, we follow the previous works to apply gradient clipping with maximal gradient norm of 20 to stabilize the training. By default, we used K = 20 for sampling pre-evidences from the predicted Gaussian distribution during both training and inference. Table 5 lists the β values used in different experiments.