reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FLUID: A Unified Evaluation Framework for Flexible Sequential Data

Authors: Matthew Wallingford, Aditya Kusupati, Keivan Alizadeh-Vahid, Aaron Walsman, Aniruddha Kembhavi, Ali Farhadi

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on a broad set of methods which shed new insight on the advantages and limitations of current techniques and indicate new research problems to solve. As a starting point towards more general methods, we present two new baselines which outperform other evaluated methods on Fluid.
Researcher Affiliation	Collaboration	Matthew Wallingford EMAIL University of Washington Aditya Kusupati EMAIL University of Washington Keivan Alizadeh EMAIL University of Washington Aaron Walsman EMAIL University of Washington Aniruddha Kembhavi EMAIL Allen Institute for Artificial Intelligence Ali Farhadi EMAIL University of Washington
Pseudocode	Yes	Algorithm 1 Fluid Procedure Input: Task T Input: ML sys.: (pretrained) model f, update strategy S Output: Evaluations: E, Operation Counter: C 1: function Fluid(T , (f, S)) 2: Evaluations E = [ ] 3: Datapoints D = [ ] 4: Operation Counter C = 0.
Open Source Code	No	The framework, data and models will be open-sourced.
Open Datasets	Yes	Data In this paper, we evaluate methods with Fluid using a subset of Image Net-22K (Deng et al., 2009). Traditionally, few-shot learning used datasets like Omniglot (Lake et al., 2011) & Mini Imagenet (Vinyals et al., 2016) and continual learning focused on MNIST (Le Cun, 1998) & CIFAR (Krizhevsky et al., 2009). ... We evaluate on the Image Net-22K dataset to present new challenges to existing models. Recently, the INaturalist (Van Horn et al., 2018; Wertheimer & Hariharan, 2019) and LVIS (Gupta et al., 2019) datasets have advocated for heavy-tailed distributions.
Dataset Splits	Yes	The dataset consists of a pretraining dataset and 5 different sequences of images for streaming (3 test and 2 validation sequences). For pretraining we use the standard Image Net-1K (Russakovsky et al., 2015). ... Each test sequence contains images from 1000 different classes, 750 of which do not appear in Image Net-1K. ... Each sequence contains 90000 samples, where head classes contain > 50 and tail classes contain 50 samples. ... More comprehensive statistics on the data and sequences are in Appendix B. Table 4: Statistics for the sequences of images used in Fluid. Sequences 1-2 are for validation and Sequence 3-5 are for testing.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models or processor types used for running its experiments. It mentions 'Res Net18 and Res Net50 models' but these refer to model architectures, not hardware.
Software Dependencies	No	We use the Py Torch (Paszke et al., 2019) Res Net18 and Res Net50 models pretrained on supervised Image Net-1K.
Experiment Setup	Yes	For all experiments (Table 3) that require offline training (fine-tuning, Weight Imprinting, standard training, ET and Lw F), except OLTR, we train each model for 4 epochs every 5,000 samples observed. ... Fine-tuning experiments use a learning rate of 0.1 and standard training uses 0.01 for supervised pretraining. For Mo Co pretraining fine-tuning uses a learning rate of 30 and standard training uses 0.01. All the experiments use the SGD+Momentum optimizer with a 0.9 momentum. For Prototypical-Networks and MAML we meta-train from scratch with the n-shot k-way paradigm. We use 5-shot 30-way in accordance with the original works (Snell et al., 2017) (Finn et al., 2017). We trained according We meta-train for 100 epochs with a learning rate of 0.01 and reduce it by 0.5 every 40 epochs.