reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Prune ’n Predict: Optimizing LLM Decision-making with Conformal Prediction

Authors: Harit Vishwakarma, Alan Mishler, Thomas Cook, Niccolo Dalmasso, Natraj Raman, Sumitra Ganesh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments on MMLU, Tool Alpaca, and Truthful QA datasets with multiple LLMs show that CROQ improves accuracy over the standard inference, with more pronounced gains when paired with CP-OPT.
Researcher Affiliation	Collaboration	1JPMorgan Chase AI Research, New York, NY, USA 2Department of Computer Science, University of Wisconsin, Madison, WI 53706, USA. This work was performed while at JPMorgan Chase. Correspondence to: Harit Vishwakarma <EMAIL>.
Pseudocode	No	The paper describes the steps for CROQ and CP-OPT methodology in prose, for example, "The procedure involves prompting the LLM with the reduced answer options from a conformal prediction set. The steps are illustrated with an example in Figure 2." It does not contain a formal pseudocode or algorithm block.
Open Source Code	No	The paper does not include an unambiguous statement about releasing code for the methodology described, nor does it provide a direct link to a source-code repository.
Open Datasets	Yes	We conduct experiments on benchmark MCQ and tool usage tasks... Datasets. We evaluate our hypotheses on 3 datasets: MMLU (Hendrycks et al., 2021), Truthful QA (Lin et al., 2022), and Tool Alpaca (Tang et al., 2023). MMLU and Truthful QA are popular benchmark datasets for multiple-choice questions.
Dataset Splits	Yes	The standard dataset has very little training points, so we randomly draw 30%, and 10% of the points from the test split and include them in the training set and validation set respectively. Note, that we remove these points from the test set. The resulting splits have 4.5k, 2.9k, and 8.4k points in the train, validation, and test splits. ... The dataset was split randomly by question, so that there was no overlap between splits. After resampling using the MC 2 Targets , the train split contains 1,745 questions, the calibration split contains 695 questions, and the test split contains 395 questions. ... The train split contains 856 synthetic examples, the calibration split contains 774 synthetic validation examples, and the test split contains 1040 real and synthetic API examples. Splits are created to ensure no overlap in APIs occur.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper mentions using specific LLMs like Llama-3-8B-Instruct, Phi-3-4k-mini-Instruct, and gemma-2-9b-it-Sim PO, but it does not specify versions of any ancillary software components like programming languages, frameworks (e.g., PyTorch, TensorFlow), or libraries with their version numbers.
Experiment Setup	Yes	The hyperparameter settings we used for CP-OPT are given in Appendix F. ... The hyperparameters used to learn the score function using SGD are provided in table 21 in Appendix F.