reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AdaFlood: Adaptive Flood Regularization

Authors: Wonho Bae, Yi Ren, Mohamed Osama Ahmed, Frederick Tung, Danica J. Sutherland, Gabriel L. Oliveira

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments (Section 4) demonstrate that Ada Flood generally outperforms previous flood methods on a variety of tasks, including image and text classification, probability density estimation for asynchronous event sequences, and regression for tabular datasets.
Researcher Affiliation	Collaboration	Wonho Bae EMAIL University of British Columbia Yi Ren EMAIL University of British Columbia Mohamed Osama Ahmed EMAIL Borealis AI Frederick Tung EMAIL Borealis AI Danica J. Sutherland EMAIL University of British Columbia & Amii Gabriel L. Oliveira EMAIL Borealis AI
Pseudocode	Yes	Algorithm 1 Training of Auxiliary Network(s) and Ada Flood 1: Train a single auxiliary network f aux on the entire training set D Fine-tuning method only 2: for Daux,i in {Daux,i}n i=1 do 3: Train f aux,i, either from scratch or by fine-tuning f aux, on D \ Daux,i 4: Save the adaptive flood level θi for each xi Daux,i using f aux,i on x Daux,i 5: end for 6: Train the main model f using Equation (3) and adaptive flood levels θ computed above
Open Source Code	No	Reproducibility For each experiment, we listed implementation details such as model, regularization, and search space for hyperparameters. We also specified datasets we used for each experiment, and how they were split and augmented, along with the description of metrics. The code is released with the final version.
Open Datasets	Yes	We use two popular benchmark datasets, Stack Overflow (predicting the times at which users receive badges) and Reddit (predicting posting times). Following Bae et al. (2023), we also benchmark our method on a dataset with stronger periodic patterns: Uber (predicting pick-up times)... We use SVHN (Netzer et al., 2011), CIFAR-10, and 100 (Krizhevsky et al., 2009) for image classification... We also use the tabular datasets Brazilian Houses and Wine Quality from Open ML (Vanschoren et al., 2013)... We further employ Stanford Sentiment Treebank (SST-2)... We use Image Net100 (Tian et al., 2020) for image classification... We use NYC Taxi Tip dataset from Open ML (Vanschoren et al., 2013)...
Dataset Splits	Yes	We split each training dataset into train (80%) and validation (20%) sets. Details are provided in Appendix A. ...we split each training dataset into train (80%) and validation (20%) sets for hyperparameter search; thus our numbers are generally somewhat worse than what they reported, as we do not directly tune on the test set. ...We typically use five-fold cross-validation as a reasonable trade-off between computational expense and good-enough models to estimate θi
Hardware Specification	No	No specific hardware details (like GPU/CPU models or processor types) are provided for running the experiments. The paper discusses computational costs but not the hardware used.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow versions) are explicitly mentioned in the paper.
Experiment Setup	Yes	For each dataset, we conduct hyper-parameter tuning for learning rate and the weight for L2 regularization with the unregularized baseline (we still apply early stopping and L2 regularization by default). Once learning rate and weight decay parameters are fixed, we search for the optimal flood levels. The optimal flood levels are selected via a grid search on { 50, 45, 40 . . . , 0, 5} { 4, 3 . . . , 3, 4} for Flood and i Flood, and optimal γ on {0.0, 0.1 . . . , 0.9} for Ada Flood using the validation set.