Conformity Score Averaging for Classification

Authors: Rui Luo, Zhixin Zhou

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations on benchmark datasets show that our weighted averaging approach consistently outperforms single-score methods by producing smaller prediction sets without sacrificing coverage. Our code is available at https://github.com/luo-lorry/Weighting.
Researcher Affiliation Collaboration 1Department of System Engineering, City University of Hong Kong, China 2Alpha Benito Research, Los Angeles, USA. Correspondence to: Rui Luo <EMAIL>, Zhixin Zhou <EMAIL>.
Pseudocode Yes Algorithm 1 Split Conformal Prediction. Algorithm 2 Conformal Score Averaging.
Open Source Code Yes Our code is available at https://github.com/luo-lorry/Weighting.
Open Datasets Yes Empirical evaluations on benchmark datasets show that our weighted averaging approach consistently outperforms single-score methods by producing smaller prediction sets without sacrificing coverage. Our code is available at https://github.com/luo-lorry/Weighting. In the experiments on CIFAR-10 and CIFAR-100, testing images, which were not used during the pretraining of the model, were used as the Itrain and Itest sets. We conducted additional experiments on MNIST, Fashion-MNIST, and Image Net-Val.
Dataset Splits Yes Section 2.5. Data Splitting: In Algorithm 2, the method for splitting the data into I1, I2, and I3 has not been specified. To explore potential data splitting approaches, we first highlight the following two key observations... Based on the observations above, we introduce the following four possible ways of data splitting... 100 runs with different index splits were conducted to ensure robustness. We have additional experiments on data splitting ratio, which can be found in Section G in the supplementary file. (b) Comparison of coverage and size for different data split methods at a significance level of α = 0.05 when Itrain : I_test = 99:1.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or other detailed computer specifications used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers needed to replicate the experiment.
Experiment Setup Yes The experiments were performed for different significance levels α ranging from 0.01 to 0.05. 100 runs with different index splits were conducted to ensure robustness. To solve (4), we discretize the probability simplex: using a grid resolution ε = 0.01.