reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Confidence Difference Reflects Various Supervised Signals in Confidence-Difference Classification

Authors: Yuanchao Dai, Ximing Li, Changchun Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on benchmark and UCI datasets demonstrate the effectiveness of our method. Additionally, to effectively capture the influence of real-world noise on the confidence difference, we artificially perturb the confidence difference distribution and demonstrate the robustness of our method under noisy conditions through comprehensive experiments.
Researcher Affiliation	Academia	1College of Computer Science and Technology, Jilin University, China 2Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China. Correspondence to: Ximing Li <EMAIL>.
Pseudocode	No	The paper contains mathematical derivations and proofs in the main body and appendices, but does not feature any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper mentions utilizing publicly available code for baseline methods (e.g., 'We utilize the publicly available code online3.'). However, there is no explicit statement or link provided by the authors for the open-source code of their proposed CRCR method.
Open Datasets	Yes	To thoroughly evaluate our method, we employ four popular benchmark datasets, including MNIST (Le-Cun et al., 1998), Kuzushiji-MNIST (K-MNIST) (Clanuwat et al., 2018), Fashion-MNIST (F-MNIST)(Xiao et al., 2017) and CIFAR-10 (Krizhevsky, Technical report, University of Toronto, 2009). Additionally, experiments are conducted on two UCI datasets 1, including Optdigits and Pendigits. 1http://archive.ics.uci.edu/
Dataset Splits	Yes	Table 1. Detailed characteristics of datasets. Dataset #Instance #Trainset #Testset #Fea Pos Class Neg Class Backbone MNIST 70,000 15,000 5,000 28x28 0,2,4,6,8 1,3,5,7,9 3-layer MLP F-MNIST 70,000 15,000 5,000 28x28 0,2,4,6,8 1,3,5,7,9 3-layer MLP K-MNIST 70,000 15,000 5,000 28x28 0,2,4,6,8 1,3,5,7,9 3-layer MLP CIFAR-10 60,000 10,000 5,000 3x32x32 2,3,4,5,6,7 0,1,8,9 ResNet-34 Optdigits 5,620 1,200 1,125 62 0,2,4,6,8 1,3,5,7,9 Linear Pendigits 10,992 2,500 2,199 16 0,2,4,6,8 1,3,5,7,9 Linear
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory specifications) used for running the experiments.
Software Dependencies	No	The paper mentions using a 'logistic loss function' and 'Adam optimizer', and refers to 'Re LU activation function and batch normalization'. However, it does not specify any software packages with version numbers (e.g., Python version, specific deep learning framework versions like PyTorch or TensorFlow, or other library versions).
Experiment Setup	Yes	Implementation details For each comparison method under every experimental configuration, we execute the code five times, employing the logistic loss function and Adam optimizer consistently. Specifically, during the training phase, each run is independently performed for 200 epochs with a batch size of 256. In balanced scenarios (i.e.,, π = 0.5), the learning rate is set to 10-3 across all datasets, with weight decay parameters set to 10-5 for MNIST, KMNIST, F-MNIST, and CIFAR-10, 10-4 for Optdigits, and 10-3 for Pendigits. In imbalanced scenarios (i.e.,, π = 0.2), the learning rate is set to 10-4 for MNIST and K-MNIST, and 10-3 for the remaining datasets, with weight decay parameters set to 10-4 for K-MNIST and Optdigits, and 10-5 for the remaining datasets. During the pretraining phase, each run is independently executed for 20 epochs with a batch size of 256. The learning rate and weight decay remain consistent with those in the training phase. Moreover, we choose different models as backbones based on the varying feature dimensions of each dataset. Specifically, for MNIST, K-MNIST and F-MNIST, we use a 3-layer multilayer perceptron (MLP) with three hidden layers of width 300 equipped with the Re LU (Nair & Hinton, 2010) activation function and batch normalization (Ioffe & Szegedy, 2015). For CIFAR-10, we train a Res Net-34 model (He et al., 2016) as the backbone. For all UCI datasets, we use a linear model for training.