reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Direct Prediction Set Minimization via Bilevel Conformal Classifier Training

Authors: Yuanjie Shi, Hooman Shahrokhi, Xuesong Jia, Xiongzhi Chen, Jana Doppa, Yan Yan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on various benchmark datasets and deep models show that DPSM significantly outperforms the best prior conformal training baseline with 20.46% in the prediction set size and validates our theory.
Researcher Affiliation	Academia	1School of Electrical Engineering and Computer Science, Washington State University, Pullman, Washington, USA 2Department of Mathematics and Statistics, Washington State University, Pullman, Washington, USA.
Pseudocode	Yes	Algorithm 1 Direct Prediction Set Minimization (DPSM)
Open Source Code	Yes	The DPSM code is available at https://github. com/Yuanjie Sh/DPSM_code.
Open Datasets	Yes	We utilize the benchmark datasets CIFAR-100 (Krizhevsky et al., 2009), Caltech-101 (Fei-Fei et al., 2004), and i Naturalist (Van Horn et al., 2018), where all details are summarized in Table 2 of Appendix E.
Dataset Splits	Yes	Table 2. Description of the data sets are given in the table. The number of classes in the i Naturalist data set depends on the taxonomy level (e.g., species, genus, family). We employ Fungi species which has 341 different categories. Data Number of Classes Number of Training Data Number of Validation Data Number of Calibration Data Number of Test Data CIFAR-100 100 45000 5000 3000 7000 Caltech-101 101 4310 1256 1111 2000 i Naturalist 341* 15345 1705 1410 2000
Hardware Specification	No	The paper mentions deep models and neural network architectures but does not specify the hardware used to run experiments. There are no mentions of specific GPU models, CPU models, TPUs, or detailed cloud instance specifications.
Software Dependencies	No	The paper mentions using specific architectures like Res Net and Dense Net, and the SGD optimizer. However, it does not provide version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages, which are necessary for reproducibility.
Experiment Setup	Yes	Table 3. The below table shows the details we used to train our models. We reported the hyperparameters which gives the best predictive efficiency. We employed SGD optimizer for all training unless specified. Data Architecture Batch size Epochs η lr schedule Momentum weight decay γ λ CIFAR-100 Dense Net 64 40 0.1 25 0.9 0.1 0.01 0.05 Res Net 128 40 0.1 25 0.9 0.1 0.01 0.01 Caltech-101 Dense Net 128 60 0.05 25, 40 0.9 0.1 0.1 1.0 Res Net 128 60 0.05 25, 40 0.9 0.1 0.05 0.1 i Naturalist Dense Net 128 60 0.001 3 0.9 0.97 0.001 1.0 Res Net 128 60 0.001 3 0.9 0.97 0.001 0.5