Conformal Structured Prediction
Authors: Botong Zhang, Shuo Li, Osbert Bastani
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate our approach by demonstrating that it constructs prediction sets that satisfy the desired coverage guarantees while producing reasonably sized prediction sets.1 We evaluate our approach on five tasks: (i) predicting numbers represented as lists of MNIST digits (Le Cun & Cortes, 2010), (ii) Image Net classification (Deng et al., 2009) with hierarchical label space tasks, (iii) SQu AD question answering (Rajpurkar et al., 2016) where the answer is a year, (iv) Python code generation based on the MBPP dataset (Austin et al., 2021), similar to the task studied in Khakhar et al. (2023), and (v) predicting emotions on the Go Emotions dataset (Demszky et al., 2020). Our experiments demonstrate how our approach can be used to construct small prediction sets while satisfying a desired coverage guarantee (marginal or PAC). |
| Researcher Affiliation | Academia | Botong Zhang, Shuo Li & Osbert Bastani Computer and Information Science University of Pennsylvania EMAIL |
| Pseudocode | No | The paper describes the algorithms in Section 3 ('Algorithms for Structured Conformal Prediction') and Section 4 ('Application to DAG Structured Prediction Sets'), including an integer programming formulation, but does not present explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The implementation is available at https://github.com/botong516/Conformal-Structured-Prediction. |
| Open Datasets | Yes | We empirically validate our approach by demonstrating that it constructs prediction sets that satisfy the desired coverage guarantees while producing reasonably sized prediction sets.1 We evaluate our approach on five tasks: (i) predicting numbers represented as lists of MNIST digits (Le Cun & Cortes, 2010), (ii) Image Net classification (Deng et al., 2009) with hierarchical label space tasks, (iii) SQu AD question answering (Rajpurkar et al., 2016) where the answer is a year, (iv) Python code generation based on the MBPP dataset (Austin et al., 2021), similar to the task studied in Khakhar et al. (2023), and (v) predicting emotions on the Go Emotions dataset (Demszky et al., 2020). |
| Dataset Splits | No | The paper mentions using a held-out calibration set Z and a held-out test set for evaluation, for example in the section '2 PROBLEM FORMULATION' and '5.2 RESULTS'. However, it does not explicitly provide specific details about the splitting methodology (e.g., percentages, absolute counts for train/validation/test splits, or how the held-out test set was created from the original datasets). |
| Hardware Specification | No | The paper does not specify any particular hardware, such as GPU models, CPU types, or other computing resources used for the experiments. |
| Software Dependencies | No | The paper refers to specific models used like 'Llama-3.1-70B-Instruct model (Dubey et al., 2024)', 'gpt-4o-mini model', and 'RoBERTa base model (Sam Lowe, 2024)', but does not list specific software libraries or programming language versions required to replicate the experiments (e.g., Python version, PyTorch version, etc.). |
| Experiment Setup | Yes | Hyperparameters. We use m {1, 2, 4, 8} (default of m = 4), ϵ {0.05, 0.1, 0.15, 0.2} (default of ϵ = 0.1), and δ {0.1, 0.01, 0.001} (default of δ = 0.01). Also, for the question answering task, it details using a 'two-shot prompting technique' and for Python code generation, it states 'we provided the gpt-4o-mini model with a natural language prompt along with k lines of code from the original ground truth program in the dataset, instructing the model to complete the program to solve the prompt.' |