reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning to Route LLMs with Confidence Tokens

Authors: Yu-Neng Chuang, Prathusha Kameswara Sarma, Parikshit Gopalan, John Boccio, Sara Bolouki, Xia Hu, Helen Zhou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose Self Reflection with Error-based Feedback (Self-REF), a lightweight training strategy to teach LLMs to express confidence in whether their answers are correct in a reliable manner. ... We conduct experiments to evaluate the performance of Self-REF, centered around the following three research questions: RQ1: Compared with state-of-the-art baselines, how does Self-REF perform on confidence-based routing? RQ2: How reliable are confidence scores from Self-REF for the rejection learning task? RQ3: How well-aligned are confidence-token-based scores of Self-REF and the actual probabilities of correctness? ... Confidence-based routing using Self-REF consistently achieves the best accuracy vs. routing rate trade-off for all four datasets (MMLU, Openbook QA, GSM8K, Med QA) and both local LLMs (Llama3-8B-Instruct model and Mistral-7B-Instruct) (Figure 2).
Researcher Affiliation	Collaboration	*Work done during the internship at Apple. 1Rice University, Houston, TX, USA 2Apple, Inc., Cupertino, CA, USA. Correspondence to: Yu-Neng Chuang <EMAIL>, Helen Zhou <EMAIL>.
Pseudocode	Yes	Algorithm 1 Data Augmentation for Self-REF input Base LLM M( ) and original training data Dtrain = {(x(i), y(i))}Ntrain i=1 . output Augmented data D train with unconfident samples and confident samples. Use M( ) to make predictions by(i) for the given input query x(i). 1: Create a set of unconfident samples: D train,<UN> = {(x(i), by(i)<UN>) : y(i) = by(i)}Ntrain i=1 . 2: Create a set of confident samples: D train,<CN> = {(x(i), by(i)<CN>) : y(i) = by(i)}Ntrain i=1 . 3: Mix the unconfident and confident data to form the combined augmented dataset, subsampling the unconfident samples with tunable proportion α [0, 1]: D train = subsampleα(D train,<UN>) D train,<CN>.
Open Source Code	No	The paper does not contain an explicit statement or link indicating the release of source code for the methodology described in this paper.
Open Datasets	Yes	All experiments are conducted on the following four public datasets (more details in Appendix B): MMLU (Hendrycks et al., 2021a;b): ... Openbook QA (Mihaylov et al., 2018): ... GSM8K (Cobbe et al., 2021): ... Med QA (Jin et al., 2021): ...
Dataset Splits	Yes	Let D = {(x(i), y(i))}N i=1 denote a dataset, where x(i) are queries and y(i) are ground truth answers to the queries. Let Dtrain denote the training split of the dataset, Dval the validation split, and Dtest the test split, where D = Dtrain Dval Dtest. ... In the experiment, we evaluate the model on val set [for MMLU]. ... We assess the model on 1,331 test set [for GSM8K]. ... To study this question [rejection learning], we create an evaluation set where half of the samples do not contain the ground truth, i.e., we remove all ground-truth information from x(i), and replace the label with none of the above, y(i) = .
Hardware Specification	Yes	Device Attribute Value Computing infrastructure GPU GPU model Nvidia-A100 GPU number 8 GPU Memory 80G
Software Dependencies	No	The paper does not provide specific software names with version numbers for libraries or solvers (e.g., PyTorch version, Python version, CUDA version) beyond the LLM models used.
Experiment Setup	Yes	Our experiments on the dataset are conducted under Self-REF. We focus on the confidence-based routing and rejection learning tasks, following the pipeline of Base Model Training and Confidence Token Inference. The details of each step is shown as follows: Base Model Training In this work, Self-REF is fine-tuned on local LLMs (i.e., Llama3-8B-Instruct and Mistral-7B-Instruct) under the following hyper-parameters. We apply Lo RA adapters to every query, key, and value (Q-K-V) layers, the token embedding layers, and the final linear layers in the local LLMs with the batch size of 4 and learning rate of 1e-4. Fixing the overall dataset size, the parameter α is tuned based on performance on the validation set, selecting from a ratio of unconfident to confident data of 1:1, 1:2, 1:3, 1:4, and 1:5. ... Self-REF Inference During the inference time, the temperature is set as 0, top-p as 1.0, and all decoding processes are greedy search. All other sampling strategies are forbidden with the fixed random seed 2024 for the reproducibility. ... Table 5: Hyper-parameters and model structures settings in Self-REF.