reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Conformal Structured Prediction

Authors: Botong Zhang, Shuo Li, Osbert Bastani

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate our approach by demonstrating that it constructs prediction sets that satisfy the desired coverage guarantees while producing reasonably sized prediction sets.1 We evaluate our approach on five tasks: (i) predicting numbers represented as lists of MNIST digits (Le Cun & Cortes, 2010), (ii) Image Net classification (Deng et al., 2009) with hierarchical label space tasks, (iii) SQu AD question answering (Rajpurkar et al., 2016) where the answer is a year, (iv) Python code generation based on the MBPP dataset (Austin et al., 2021), similar to the task studied in Khakhar et al. (2023), and (v) predicting emotions on the Go Emotions dataset (Demszky et al., 2020). Our experiments demonstrate how our approach can be used to construct small prediction sets while satisfying a desired coverage guarantee (marginal or PAC).
Researcher Affiliation	Academia	Botong Zhang, Shuo Li & Osbert Bastani Computer and Information Science University of Pennsylvania EMAIL
Pseudocode	No	The paper describes the algorithms in Section 3 ('Algorithms for Structured Conformal Prediction') and Section 4 ('Application to DAG Structured Prediction Sets'), including an integer programming formulation, but does not present explicit pseudocode or algorithm blocks.
Open Source Code	Yes	The implementation is available at https://github.com/botong516/Conformal-Structured-Prediction.
Open Datasets	Yes	We empirically validate our approach by demonstrating that it constructs prediction sets that satisfy the desired coverage guarantees while producing reasonably sized prediction sets.1 We evaluate our approach on five tasks: (i) predicting numbers represented as lists of MNIST digits (Le Cun & Cortes, 2010), (ii) Image Net classification (Deng et al., 2009) with hierarchical label space tasks, (iii) SQu AD question answering (Rajpurkar et al., 2016) where the answer is a year, (iv) Python code generation based on the MBPP dataset (Austin et al., 2021), similar to the task studied in Khakhar et al. (2023), and (v) predicting emotions on the Go Emotions dataset (Demszky et al., 2020).
Dataset Splits	No	The paper mentions using a held-out calibration set Z and a held-out test set for evaluation, for example in the section '2 PROBLEM FORMULATION' and '5.2 RESULTS'. However, it does not explicitly provide specific details about the splitting methodology (e.g., percentages, absolute counts for train/validation/test splits, or how the held-out test set was created from the original datasets).
Hardware Specification	No	The paper does not specify any particular hardware, such as GPU models, CPU types, or other computing resources used for the experiments.
Software Dependencies	No	The paper refers to specific models used like 'Llama-3.1-70B-Instruct model (Dubey et al., 2024)', 'gpt-4o-mini model', and 'RoBERTa base model (Sam Lowe, 2024)', but does not list specific software libraries or programming language versions required to replicate the experiments (e.g., Python version, PyTorch version, etc.).
Experiment Setup	Yes	Hyperparameters. We use m {1, 2, 4, 8} (default of m = 4), ϵ {0.05, 0.1, 0.15, 0.2} (default of ϵ = 0.1), and δ {0.1, 0.01, 0.001} (default of δ = 0.01). Also, for the question answering task, it details using a 'two-shot prompting technique' and for Python code generation, it states 'we provided the gpt-4o-mini model with a natural language prompt along with k lines of code from the original ground truth program in the dataset, instructing the model to complete the program to solve the prompt.'