reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Identifying and Benchmarking Natural Out-of-Context Prediction Problems

Authors: David Madras, Richard Zemel

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we explore the tradeoffs between various learning approaches on these challenge sets and demonstrate how the choices made in designing OOC benchmarks can yield varying conclusions.
Researcher Affiliation	Collaboration	David Madras University of Toronto Vector Institute EMAIL Richard Zemel University of Toronto Vector Institute Columbia University EMAIL
Pseudocode	No	The paper describes algorithms and methods in text, but it does not provide any explicitly labeled
Open Source Code	Yes	We present NOOCH (Naturally-Occurring Out-of-context Challenge sets), a suite of challenge sets for evaluating performance on naturally-arising OOC problems, available at https://github.com/dmadras/nooch;
Open Datasets	Yes	Background: COCO and COCO-Stuff. The Microsoft Common Objects in COntext dataset (COCO) [36] is a computer vision dataset... Fortunately, the COCO-Stuff dataset [7] provides labels... For all experiments we use a Res Net-50 [23], ﬁnetuned from Image Net-pretrained features [53].
Dataset Splits	Yes	Many of the robust baselines from Sec. 4 come with a hyperparameter which aims to trade off between average performance and OOC performance; we choose the hyperparameter which minimizes the maximum loss of hard positives and hard negatives on the validation set.
Hardware Specification	No	The paper states,
Software Dependencies	No	The paper mentions using
Experiment Setup	Yes	For all experiments we use a Res Net-50 [23], ﬁnetuned from Image Net-pretrained features [53]. We train binary classiﬁers to minimize average NLL on each of the 171 classes in COCO-Stuff. For the environment-based methods, we follow Sagawa et al. [54] and create 4 environments: 1 for each element of the cross-product of the label and its highest-α context class. Many of the robust baselines from Sec. 4 come with a hyperparameter which aims to trade off between average performance and OOC performance; we choose the hyperparameter which minimizes the maximum loss of hard positives and hard negatives on the validation set.