reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unbiased Loss Functions for Multilabel Classification with Missing Labels

Authors: Erik Schultheis, Rohit Babbar

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The theoretical considerations are further supplemented by an experimental study showing that the switch to unbiased estimators significantly alters the bias-variance trade-off and may thus require stronger regularization. In order to judge the severity of the variance and overfitting problems in practice, we conducted three experiments. First, in a pure evaluation setting on synthetic, we calculated the unbiased recall@k for a varying fraction of missing labels, which shows that once this fraction becomes too large, the variance of the estimate explodes and it becomes unusable. The second experiment serves as a demonstration for the change in bias-variance trade-off as a result from switching to unbiased estimates. Finally, we repeat that experiment using real data from the Yahoo-Music R3 dataset... 7 Evaluation Experiments 8 Training Experiments
Researcher Affiliation	Academia	Erik Schultheis EMAIL Aalto University Espoo, Finland Rohit Babbar EMAIL University of Bath & Aalto University Bath, UK & Espoo, Finland
Pseudocode	No	The paper describes mathematical derivations and experimental procedures but does not include any explicitly labeled pseudocode or algorithm blocks. The methods are described narratively and through equations.
Open Source Code	Yes	The code for the experiments is provided at https://github.com/ xmc-aalto/missing-labels-tmlr.
Open Datasets	Yes	Finally, we repeat that experiment using real data from the Yahoo-Music R3 dataset, which has been sampled in such a way that the proportion of missing labels can be stimated reliably. ... We took the Amazon Cat-13k data and consider only the 100 most common labels... The networks used to generate the results in Table 3 were trained using the Di SMEC (Babbar and Schölkopf, 2017) algorithm, with the loss function being either the squared-hinge-loss (VN) or a squared-hinge-loss based convex surrogate of the unbiased estimate of the 0-1 loss as described in Qaraei et al. (2021) The datasets have been taken from the Extreme Classification Repository (Bhatia et al., 2016), and preprocessed by doing a tf-idf transformation. (Mentioning Bibtex, Mediamill, RCV1-2K, Amazon Cat-13K, Amazon Cat-14K, Eurlex-4K, Amazon-670K, Wiki LSHTC-325K in tables and text).
Dataset Splits	Yes	We took 30% of the original training data and used them as validation data to determine the optimal value for the strength of L2-regularization. ... the test set is based on explicitly querying the user on 10 out of the 1000 possible songs chosen uniformly at random.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models. It only mentions general terms like 'training' or 'network'.
Software Dependencies	No	The paper mentions that the network is optimized using Adam (Kingma and Ba, 2017), but it does not specify any version numbers for Adam or any other software libraries or frameworks used in the experiments.
Experiment Setup	Yes	On this data, we train a linear classifier with L2-regularization using different basis loss functions... The network is optimized using Adam (Kingma and Ba, 2017) with an initial learning rate of 10 4 for the first 15 epochs and 10 5 for the remaining five epochs, with a mini-batch size of 512. We took 30% of the original training data and used them as validation data to determine the optimal value for the strength of L2-regularization.