reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Nearly Optimal Sample Complexity for Learning with Label Proportions

Authors: Robert Istvan Busa-Fekete, Travis Dick, Claudio Gentile, Haim Kaplan, Tomer Koren, Uri Stemmer

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we compare our proposed LLP loss to baseline losses on several datasets and models following closely the setup of Busa-Fekete et al. (2023). ... Figure 1 shows the final test accuracy of each aggregate loss at various bag sizes and number of training epochs (some of the curves are not visible since they are overlapping to one another).
Researcher Affiliation	Collaboration	1Google Research, New York, USA 2Tel Aviv University, Israel. Correspondence to: Claudio Gentile <EMAIL>, Travis Dick <EMAIL>.
Pseudocode	Yes	Algorithm 1 Algorithm for the two function realizable case. ... Algorithm 2 SGD-based algorithm
Open Source Code	No	The paper does not provide an explicit statement about releasing its source code, nor does it include a link to a code repository. Mentions of other companies' platforms (Apple SKAN, Google Chrome) are not related to the authors' own code release for the methodology described.
Open Datasets	Yes	We conduct our experiments on the following datasets and models. We use versions of MNIST (Le Cun et al., 2010) and CIFAR-10 (Krizhevsky, 2009) with binary labels, together with the Higgs (Baldi et al., 2014) and UCI Adult (Kohavi & Becker, 1996) datasets, which are already binary tasks.
Dataset Splits	Yes	We prepare each training dataset by shuffling the data, partitioning it into consecutive bags of size k, and replacing the labels within each bag by their average. ... We use the first 10,000 examples as test data, and the remaining examples as training data. ... The data contains 32,561 training examples and 16,281 test examples.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or other computer specifications used for running the experiments.
Software Dependencies	No	The paper mentions using Adam (Kingma & Ba, 2015) for optimization, but does not specify versions for any programming languages, libraries, or frameworks used in the implementation.
Experiment Setup	Yes	On all datasets we choose a batch size of N = 1024 and use bag sizes k = 2i for i = 0, . . . , 9. For each batch, we compute the gradient of the aggregate loss function and use Adam (Kingma & Ba, 2015) to update the model parameters. We repeat the above training procedure 10 times for each loss function, dataset, bag size, and Adam learning rate in the set {10 6, 5 10 6, 10 5, 10 4, 5 10 4, 10 3, 5 10 3, 10 2}.