reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Calibrating LLMs with Information-Theoretic Evidential Deep Learning

Authors: Yawei Li, David Rügamer, Bernd Bischl, Mina Rezaei

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across various fine-tuned LLMs and tasks demonstrate that IB-EDL outperforms both existing EDL and non-EDL approaches. By improving the trustworthiness of LLMs, IB-EDL facilitates their broader adoption in domains requiring high levels of confidence calibration.
Researcher Affiliation	Academia	Yawei Li David R ugamer Bernd Bischl Mina Rezaei Department of Statistics, LMU Munich Munich Center for Machine Learning (MCML)
Pseudocode	Yes	Algorithm 1 IB-EDL training and inference pseudocode.
Open Source Code	Yes	Code is available at https://github.com/sandylaker/ib-edl.
Open Datasets	Yes	We compare methods on six multiple-choice classification datasets, including five for commonsense reasoning, ARC-C and ARC-E (Clark et al., 2018), Openbook QA (OBQA) (Mihaylov et al., 2018), Commonsense QA (CSQA) (Talmor et al., 2019), and Sci Q (Welbl et al., 2017), alongside a dataset for reading comprehension, RACE (Lai et al., 2017).
Dataset Splits	No	The paper uses standard benchmark datasets but does not explicitly state the train/validation/test split percentages, sample counts, or a citation for the specific splits used in their experiments for each dataset. It describes how datasets are used for in-distribution (ID) and out-of-distribution (OOD) settings (e.g., "fine-tune the LLMs on OBQA (as the ID dataset) and test them on ARC-C, ARC-E, and CSQA (as OOD dataset)") and mentions perturbing 30% of labels in the training set, but this does not constitute a full specification of data partitioning for reproduction.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA H100 GPU.
Software Dependencies	No	The paper mentions using "PEFT" and "Transformers" libraries, and "PyTorch" for parallelization, but it does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	We used Dropout with a dropout rate of p = 0.1, Lo RA α = 16, rank r = 8, and set bias = "lora only". all models were trained for 30000 steps on the CSQA dataset and 10080 steps on the other datasets. The learning rate was set to 0.00005 and annealed using a cosine schedule. The maximum token length was set to 300 for the RACE dataset and 256 for all other datasets. Training was conducted with bfloat16 precision. For MCD (Gal & Ghahramani, 2016), we performed 10 forward passes. For Ens (Lakshminarayanan et al., 2017; Fort et al., 2019), we used predictions from 3 models. For EDL methods, we follow the previous works to apply gradient clipping with maximal gradient norm of 20 to stabilize the training. By default, we used K = 20 for sampling pre-evidences from the predicted Gaussian distribution during both training and inference. Table 5 lists the β values used in different experiments.