reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Deep Generalized Prediction Set Classifier and Its Theoretical Guarantees

Authors: Zhou Wang, Xingye Qiao

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our method outperforms the baselines on several benchmark datasets. ... 5 Experiments Baselines. Deep GPS is compared with baselines (GPS, BCOPS, and CDL) tailored to the task of set-valued classification with OOD detection. ... Datasets. We deploy all methods on CIFAR-10, MNIST, and Fashion-MNIST datasets. ... Metrics. We present the sample class-specific accuracy, the aligned OOD recall, and the aligned efficiency ... Results. The results of two SCOD methods are reported when the rate of incorrectly rejecting normal observations is around γ by thresholding their score functions.
Researcher Affiliation	Academia	Zhou Wang EMAIL Department of Mathematics and Statistics Binghamton University, the State University of New York Xingye Qiao EMAIL Department of Mathematics and Statistics Binghamton University, the State University of New York
Pseudocode	No	The paper describes the methodology in prose and includes figures illustrating concepts (e.g., Figure 1, Figure 2, Figure 3), but it does not contain a clearly labeled pseudocode block or algorithm.
Open Source Code	No	The paper does not contain an explicit statement about releasing their source code, nor does it provide a link to a code repository for the methodology described.
Open Datasets	Yes	Datasets. We deploy all methods on CIFAR-10, MNIST, and Fashion-MNIST datasets.
Dataset Splits	Yes	We split each original dataset into three sets: labeled set containing only normal classes to mimic distribution P, unlabeled set mixing normal and OOD classes to mimic distribution Q, and the test set. The first two are used to train the model, and the test set is to evaluate the performance. ... The training set (labeled set and unlabeled set as mentioned in Section 5) is further split with a ratio 9:1 into training data for learning f and validation data for tuning parameters.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments.
Software Dependencies	No	We use the Adam optimizer (Kingma & Ba, 2014) with learning rate lr = 10 4 and (β1, β2) = (0.999, 0.999). ... Res Net18 (He et al., 2016) architecture is the same as the one in Py Torch and Le Net-type CNNs (Ruff et al., 2020). The paper mentions PyTorch and Adam optimizer but does not specify their version numbers.
Experiment Setup	Yes	Experiments run over 150 epochs with batch size 512. ... We use the Adam optimizer (Kingma & Ba, 2014) with learning rate lr = 10 4 and (β1, β2) = (0.999, 0.999). Additionally, we set the weight decay in Adam as 10 4 ... The values of γ are prescribed as 0.05, 0.01, and 0.01 for CIFAR-10, MNIST, and Fashion-MNIST, respectively ... The tuning parameter C is determined such that the prediction set is smallest on the unlabeled part in the validation data when the misclassification rate is close to γ on the labeled part in the validation data.