Calibrating LLMs with Information-Theoretic Evidential Deep Learning
Authors: Yawei Li, David Rügamer, Bernd Bischl, Mina Rezaei
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across various fine-tuned LLMs and tasks demonstrate that IB-EDL outperforms both existing EDL and non-EDL approaches. By improving the trustworthiness of LLMs, IB-EDL facilitates their broader adoption in domains requiring high levels of confidence calibration. |
| Researcher Affiliation | Academia | Yawei Li David R ugamer Bernd Bischl Mina Rezaei Department of Statistics, LMU Munich Munich Center for Machine Learning (MCML) |
| Pseudocode | Yes | Algorithm 1 IB-EDL training and inference pseudocode. |
| Open Source Code | Yes | Code is available at https://github.com/sandylaker/ib-edl. |
| Open Datasets | Yes | We compare methods on six multiple-choice classification datasets, including five for commonsense reasoning, ARC-C and ARC-E (Clark et al., 2018), Openbook QA (OBQA) (Mihaylov et al., 2018), Commonsense QA (CSQA) (Talmor et al., 2019), and Sci Q (Welbl et al., 2017), alongside a dataset for reading comprehension, RACE (Lai et al., 2017). |
| Dataset Splits | No | The paper uses standard benchmark datasets but does not explicitly state the train/validation/test split percentages, sample counts, or a citation for the specific splits used in their experiments for each dataset. It describes how datasets are used for in-distribution (ID) and out-of-distribution (OOD) settings (e.g., "fine-tune the LLMs on OBQA (as the ID dataset) and test them on ARC-C, ARC-E, and CSQA (as OOD dataset)") and mentions perturbing 30% of labels in the training set, but this does not constitute a full specification of data partitioning for reproduction. |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA H100 GPU. |
| Software Dependencies | No | The paper mentions using "PEFT" and "Transformers" libraries, and "PyTorch" for parallelization, but it does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | We used Dropout with a dropout rate of p = 0.1, Lo RA α = 16, rank r = 8, and set bias = "lora only". all models were trained for 30000 steps on the CSQA dataset and 10080 steps on the other datasets. The learning rate was set to 0.00005 and annealed using a cosine schedule. The maximum token length was set to 300 for the RACE dataset and 256 for all other datasets. Training was conducted with bfloat16 precision. For MCD (Gal & Ghahramani, 2016), we performed 10 forward passes. For Ens (Lakshminarayanan et al., 2017; Fort et al., 2019), we used predictions from 3 models. For EDL methods, we follow the previous works to apply gradient clipping with maximal gradient norm of 20 to stabilize the training. By default, we used K = 20 for sampling pre-evidences from the predicted Gaussian distribution during both training and inference. Table 5 lists the β values used in different experiments. |