reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Local Overfitting and Forgetting in Deep Neural Networks

Authors: Uri Stern, Tomer Yaacoby, Daphna Weinshall

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive empirical evaluations demonstrate the efficacy of our method across multiple datasets, contemporary neural network architectures, and training protocols. In Section 6 we describe the empirical validation of our method in a series of experiments over image classification datasets with and without label noise, using various network architectures, including in particular modern networks over Imagenet.
Researcher Affiliation	Academia	Uri Stern, Tomer Yaacoby and Daphna Weinshall School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 91904, Israel EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Knowledge Fusion (KF) Input: Checkpoints of trained model {n0,...,n E}, w, test-pt x Output: prediction for x {A1,...,Ak}, { ε1, ..., εk} calc early forget({ n0,...,n E}) prob get class probs[E] for i 1 to k do prob A mean(get class probs[Ai w : Ai + w]) prob εi prob A + (1 εi) prob end for prediction argmax(prob) Return prediction
Open Source Code	No	The paper does not provide an explicit statement about releasing code or a link to a code repository. It refers to "Full implementation details are provided in App. E." and "complete archived version of this paper (Stern, Yaacoby, and Weinshall 2024)" for appendices, but this does not confirm the release of source code.
Open Datasets	Yes	We use various image classification datasets, neural network architectures, and training schemes. The main results are presented in Tables 1-3, followed by a brief review of our extensive ablation study and additional comparisons in Section 6.2. All references to appendices below are to be found in the complete archived version of this paper (Stern, Yaacoby, and Weinshall 2024). Specifically, in Table 1 we report results while using multiple architectures trained on CIFAR-100, Tiny Imagenet and Imagenet, with different learning rate schedulers and optimizers. For comparison, we report the results of both the original predictor and some baselines. Additional results for scenarios connected to overfitting are shown in Table 2 and App. F, where we test our method on these datasets with injected symmetric and asymmetric label noise (see App. E), as well as on a real label noise dataset (Animal10N).
Dataset Splits	Yes	In each experiment we use half of the test data for validation, to compute our method s hyper-parameters (the list of alternative epochs and {εi}), and then test the result on the remaining test data. The accuracy reported here is only on the remaining test data, averaged over three random splits of validation and test data, using different random seeds. In App. G.1 we report results on the original train/test split, where a subset of the training data is set aside for hyper-parameter tuning.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments. It mentions various neural network architectures (Resnet, Conv Ne Xt, Vi T, Max Vi T) and optimizers (SGD, Adam W) but no hardware.
Software Dependencies	No	The paper does not specify any software names with version numbers. While it implicitly uses deep learning frameworks, no versions for libraries like Python, PyTorch, or CUDA are provided.
Experiment Setup	Yes	In Fig. 3a we report the results, showing that all networks forget some portion of the data during training as in the label noise scenario, even if the test accuracy never decreases. In Section 6.1, the paper mentions using "various image classification datasets, neural network architectures, and training schemes" with "different learning rate schedulers and optimizers (SGD, Adam W)". It also states: "In each experiment we use half of the test data for validation, to compute our method s hyper-parameters (the list of alternative epochs and {εi})" and "In our experiments, we use a fixed window w = 1". Furthermore, "Full implementation details are provided in App. E."