Conformal Structured Prediction

Authors: Botong Zhang, Shuo Li, Osbert Bastani

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate our approach by demonstrating that it constructs prediction sets that satisfy the desired coverage guarantees while producing reasonably sized prediction sets.1 We evaluate our approach on five tasks: (i) predicting numbers represented as lists of MNIST digits (Le Cun & Cortes, 2010), (ii) Image Net classification (Deng et al., 2009) with hierarchical label space tasks, (iii) SQu AD question answering (Rajpurkar et al., 2016) where the answer is a year, (iv) Python code generation based on the MBPP dataset (Austin et al., 2021), similar to the task studied in Khakhar et al. (2023), and (v) predicting emotions on the Go Emotions dataset (Demszky et al., 2020). Our experiments demonstrate how our approach can be used to construct small prediction sets while satisfying a desired coverage guarantee (marginal or PAC).
Researcher Affiliation Academia Botong Zhang, Shuo Li & Osbert Bastani Computer and Information Science University of Pennsylvania EMAIL
Pseudocode No The paper describes the algorithms in Section 3 ('Algorithms for Structured Conformal Prediction') and Section 4 ('Application to DAG Structured Prediction Sets'), including an integer programming formulation, but does not present explicit pseudocode or algorithm blocks.
Open Source Code Yes The implementation is available at https://github.com/botong516/Conformal-Structured-Prediction.
Open Datasets Yes We empirically validate our approach by demonstrating that it constructs prediction sets that satisfy the desired coverage guarantees while producing reasonably sized prediction sets.1 We evaluate our approach on five tasks: (i) predicting numbers represented as lists of MNIST digits (Le Cun & Cortes, 2010), (ii) Image Net classification (Deng et al., 2009) with hierarchical label space tasks, (iii) SQu AD question answering (Rajpurkar et al., 2016) where the answer is a year, (iv) Python code generation based on the MBPP dataset (Austin et al., 2021), similar to the task studied in Khakhar et al. (2023), and (v) predicting emotions on the Go Emotions dataset (Demszky et al., 2020).
Dataset Splits No The paper mentions using a held-out calibration set Z and a held-out test set for evaluation, for example in the section '2 PROBLEM FORMULATION' and '5.2 RESULTS'. However, it does not explicitly provide specific details about the splitting methodology (e.g., percentages, absolute counts for train/validation/test splits, or how the held-out test set was created from the original datasets).
Hardware Specification No The paper does not specify any particular hardware, such as GPU models, CPU types, or other computing resources used for the experiments.
Software Dependencies No The paper refers to specific models used like 'Llama-3.1-70B-Instruct model (Dubey et al., 2024)', 'gpt-4o-mini model', and 'RoBERTa base model (Sam Lowe, 2024)', but does not list specific software libraries or programming language versions required to replicate the experiments (e.g., Python version, PyTorch version, etc.).
Experiment Setup Yes Hyperparameters. We use m {1, 2, 4, 8} (default of m = 4), ϵ {0.05, 0.1, 0.15, 0.2} (default of ϵ = 0.1), and δ {0.1, 0.01, 0.001} (default of δ = 0.01). Also, for the question answering task, it details using a 'two-shot prompting technique' and for Python code generation, it states 'we provided the gpt-4o-mini model with a natural language prompt along with k lines of code from the original ground truth program in the dataset, instructing the model to complete the program to solve the prompt.'