reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Partial-Label Learning with a Reject Option

Authors: Tobias Fuchs, Florian Kalinke, Klemens Böhm

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on artificial and real-world datasets show that our method provides the best trade-off between the number and accuracy of non-rejected predictions when compared to our competitors, which use confidence thresholds for rejecting unsure predictions. When evaluated without the reject option, our nearest-neighbor-based approach also achieves competitive prediction performance.
Researcher Affiliation	Academia	Tobias Fuchs EMAIL Karlsruhe Institute of Technology, Germany
Pseudocode	Yes	Algorithm 1 DST-PLL (Our proposed method)
Open Source Code	Yes	Experiments. Extensive experiments on artificial and real-world data support our claims. We make our code and data openly available.1 1https://github.com/mathefuchs/pll-with-a-reject-option.
Open Datasets	Yes	For the supervised datasets, we use the ecoli (Horton & Nakai, 1996), multiple-features (Duin, 2002), pen-digits (Alpaydin & Alimoglu, 1998), semeion (Buscema & Terzi, 2008), solar-flare (Dodson & Hedeman, 1989), statlog-landsat (Srinivasan, 1993), and theorem datasets (Bridge et al., 2013) from the UCI repository (Bache & Lichman, 2013). These datasets contain between 336 and 10 992 instances each. Also, we use the popular MNIST (Le Cun et al., 1999), KMNIST (Clanuwat et al., 2018), and FMNIST datasets (Xiao et al., 2018), which contain 60 000 images each similar to other datasets like Cifar-10 and Cifar-100 (Krizhevsky, 2009). For the partially labeled data, we use the bird-song (Briggs et al., 2012), flickr (Huiskes & Lew, 2008), yahoo-news (Guillaumin et al., 2010), and msrc-v2 datasets (Liu & Dietterich, 2012).
Dataset Splits	No	The paper states: "We repeat all experiments five times to report averages and standard deviations." and references a "default protocol" for data handling. However, it does not explicitly provide specific percentages for training/test/validation splits, sample counts, or a detailed methodology for how the data was partitioned for reproduction.
Hardware Specification	Yes	All experiments need two to three days on a machine with 48 cores and one NVIDIA Ge Force RTX 3090.
Software Dependencies	No	The paper states: "We have implemented all approaches in Python using the Pytorch library." However, it does not provide specific version numbers for Python, Pytorch, or any other libraries.
Experiment Setup	Yes	As mentioned in Section 5.1, we consider ten commonly used PLL approaches. We choose their parameters as recommended by the respective authors. Pl-Knn (Hüllermeier & Beringer, 2005): For all non-MNIST datasets, we use k = 10 neighbors as recommended by the authors. For the MNIST datasets, we use the hidden representation of a variational auto-encoder as instance features and use k = 20. The variational auto-encoder has a 768-dimensional input layer (flat MNIST input), a 512-dimensional second layer, and 48-dimensional bottleneck layers for the mean and variance representations. The decoder uses a 48-dimensional first layer, a 512-dimensional second layer, and a 768-dimensional output layer with sigmoid activation. Otherwise, we use Re LU activations between all layers. Binary cross-entropy is used as a reconstruction loss. We choose the Adam W optimizer for training. Pl-Svm (Nguyen & Caruana, 2008): We use the Pegasos optimizer (Shalev-Shwartz et al., 2007) and λ = 1. Ipal (Zhang & Yu, 2015): We use k = 10 neighbors, α = 0.95, and 100 iterations. Pl-Ecoc (Zhang et al., 2017): We use L = 10 log2(l) and τ = 0.1 as recommended. Proden (Lv et al., 2020): For a fair comparison, we use the same base models for all neural-networkbased approaches. We use a standard d-300-300-300-l MLP (Werbos, 1974) for the non-MNIST datasets with Re LU activations, batch normalizations, and softmax output. For the MNIST datasets, we use the Le Net architecture (Le Cun et al., 1998). We choose the Adam optimizer for training. Cc (Feng et al., 2020): We use the same base models as mentioned above for Proden. Valen (Xu et al., 2021): We use the same base models as mentioned above for Proden. Pop (Xu et al., 2023): We use the same base models as mentioned above for Proden. Also, we set e0 = 0.001, eend = 0.04, and es = 0.001. We abstain from using the data augmentations discussed in the paper for a fair comparison. Cro Sel (Tian et al., 2024): We use the same base models as mentioned above for Proden. We use 10 warm-up epochs using Cc and λcr = 2. We abstain from using the data augmentations discussed in the paper for a fair comparison. Dst-Pll (our proposed approach): Similar to Pl-Knn and Ipal, we use k = 10 neighbors for the non-MNIST datasets. For the MNIST datasets, we use the hidden representation of a variational auto-encoder as instance features and use k = 20. The architecture of the variational auto-encoder is the same as described above for Pl-Knn.