Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Identifying and Benchmarking Natural Out-of-Context Prediction Problems
Authors: David Madras, Richard Zemel
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we explore the tradeoffs between various learning approaches on these challenge sets and demonstrate how the choices made in designing OOC benchmarks can yield varying conclusions. |
| Researcher Affiliation | Collaboration | David Madras University of Toronto Vector Institute EMAIL Richard Zemel University of Toronto Vector Institute Columbia University EMAIL |
| Pseudocode | No | The paper describes algorithms and methods in text, but it does not provide any explicitly labeled |
| Open Source Code | Yes | We present NOOCH (Naturally-Occurring Out-of-context Challenge sets), a suite of challenge sets for evaluating performance on naturally-arising OOC problems, available at https://github.com/dmadras/nooch; |
| Open Datasets | Yes | Background: COCO and COCO-Stuff. The Microsoft Common Objects in COntext dataset (COCO) [36] is a computer vision dataset... Fortunately, the COCO-Stuff dataset [7] provides labels... For all experiments we use a Res Net-50 [23], finetuned from Image Net-pretrained features [53]. |
| Dataset Splits | Yes | Many of the robust baselines from Sec. 4 come with a hyperparameter which aims to trade off between average performance and OOC performance; we choose the hyperparameter which minimizes the maximum loss of hard positives and hard negatives on the validation set. |
| Hardware Specification | No | The paper states, |
| Software Dependencies | No | The paper mentions using |
| Experiment Setup | Yes | For all experiments we use a Res Net-50 [23], finetuned from Image Net-pretrained features [53]. We train binary classifiers to minimize average NLL on each of the 171 classes in COCO-Stuff. For the environment-based methods, we follow Sagawa et al. [54] and create 4 environments: 1 for each element of the cross-product of the label and its highest-α context class. Many of the robust baselines from Sec. 4 come with a hyperparameter which aims to trade off between average performance and OOC performance; we choose the hyperparameter which minimizes the maximum loss of hard positives and hard negatives on the validation set. |