Prune ’n Predict: Optimizing LLM Decision-making with Conformal Prediction
Authors: Harit Vishwakarma, Alan Mishler, Thomas Cook, Niccolo Dalmasso, Natraj Raman, Sumitra Ganesh
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments on MMLU, Tool Alpaca, and Truthful QA datasets with multiple LLMs show that CROQ improves accuracy over the standard inference, with more pronounced gains when paired with CP-OPT. |
| Researcher Affiliation | Collaboration | 1JPMorgan Chase AI Research, New York, NY, USA 2Department of Computer Science, University of Wisconsin, Madison, WI 53706, USA. This work was performed while at JPMorgan Chase. Correspondence to: Harit Vishwakarma <EMAIL>. |
| Pseudocode | No | The paper describes the steps for CROQ and CP-OPT methodology in prose, for example, "The procedure involves prompting the LLM with the reduced answer options from a conformal prediction set. The steps are illustrated with an example in Figure 2." It does not contain a formal pseudocode or algorithm block. |
| Open Source Code | No | The paper does not include an unambiguous statement about releasing code for the methodology described, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | We conduct experiments on benchmark MCQ and tool usage tasks... Datasets. We evaluate our hypotheses on 3 datasets: MMLU (Hendrycks et al., 2021), Truthful QA (Lin et al., 2022), and Tool Alpaca (Tang et al., 2023). MMLU and Truthful QA are popular benchmark datasets for multiple-choice questions. |
| Dataset Splits | Yes | The standard dataset has very little training points, so we randomly draw 30%, and 10% of the points from the test split and include them in the training set and validation set respectively. Note, that we remove these points from the test set. The resulting splits have 4.5k, 2.9k, and 8.4k points in the train, validation, and test splits. ... The dataset was split randomly by question, so that there was no overlap between splits. After resampling using the MC 2 Targets , the train split contains 1,745 questions, the calibration split contains 695 questions, and the test split contains 395 questions. ... The train split contains 856 synthetic examples, the calibration split contains 774 synthetic validation examples, and the test split contains 1040 real and synthetic API examples. Splits are created to ensure no overlap in APIs occur. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions using specific LLMs like Llama-3-8B-Instruct, Phi-3-4k-mini-Instruct, and gemma-2-9b-it-Sim PO, but it does not specify versions of any ancillary software components like programming languages, frameworks (e.g., PyTorch, TensorFlow), or libraries with their version numbers. |
| Experiment Setup | Yes | The hyperparameter settings we used for CP-OPT are given in Appendix F. ... The hyperparameters used to learn the score function using SGD are provided in table 21 in Appendix F. |