reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Conformity Score Averaging for Classification

Authors: Rui Luo, Zhixin Zhou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on benchmark datasets show that our weighted averaging approach consistently outperforms single-score methods by producing smaller prediction sets without sacrificing coverage. Our code is available at https://github.com/luo-lorry/Weighting.
Researcher Affiliation	Collaboration	1Department of System Engineering, City University of Hong Kong, China 2Alpha Benito Research, Los Angeles, USA. Correspondence to: Rui Luo <EMAIL>, Zhixin Zhou <EMAIL>.
Pseudocode	Yes	Algorithm 1 Split Conformal Prediction. Algorithm 2 Conformal Score Averaging.
Open Source Code	Yes	Our code is available at https://github.com/luo-lorry/Weighting.
Open Datasets	Yes	Empirical evaluations on benchmark datasets show that our weighted averaging approach consistently outperforms single-score methods by producing smaller prediction sets without sacrificing coverage. Our code is available at https://github.com/luo-lorry/Weighting. In the experiments on CIFAR-10 and CIFAR-100, testing images, which were not used during the pretraining of the model, were used as the Itrain and Itest sets. We conducted additional experiments on MNIST, Fashion-MNIST, and Image Net-Val.
Dataset Splits	Yes	Section 2.5. Data Splitting: In Algorithm 2, the method for splitting the data into I1, I2, and I3 has not been specified. To explore potential data splitting approaches, we first highlight the following two key observations... Based on the observations above, we introduce the following four possible ways of data splitting... 100 runs with different index splits were conducted to ensure robustness. We have additional experiments on data splitting ratio, which can be found in Section G in the supplementary file. (b) Comparison of coverage and size for different data split methods at a significance level of α = 0.05 when Itrain : I_test = 99:1.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models or other detailed computer specifications used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers needed to replicate the experiment.
Experiment Setup	Yes	The experiments were performed for different significance levels α ranging from 0.01 to 0.05. 100 runs with different index splits were conducted to ensure robustness. To solve (4), we discretize the probability simplex: using a grid resolution ε = 0.01.