reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Oops, I Sampled it Again: Reinterpreting Confidence Intervals in Few-Shot Learning

Authors: Raphael Lafargue, Luke A Smith, Franck VERMET, Matthias Löwe, Ian Reid, Jack Valmadre, Vincent Gripon

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To quantify the extent of this problem, we conduct a comparative analysis between CIs computed with and without replacement. These reveal a notable underestimation by the predominant method. This observation calls for a reevaluation of how we interpret confidence intervals and the resulting conclusions in FSL comparative studies. Our research demonstrates that the use of paired tests can partially address this issue. Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We also introduce a new optimized benchmark, which can be accessed at https://github.com/Raf Laf/FSL-benchmark-again.
Researcher Affiliation	Academia	Raphael Lafargue EMAIL IMT Atlantique, Lab-STICC, UMR CNRS 6285, F-29238, France Australian Institute for Machine Learning, University of Adelaide, Australia CNRS, IRL CROSSING, Adelaide, Australia Luke Smith EMAIL Australian Institute for Machine Learning, University of Adelaide, Australia Franck Vermet EMAIL LBMA, CNRS, UMR 6205, Univ Brest, Brest, France Matthias Löwe EMAIL University of Münster, Germany Ian Reid EMAIL Australian Institute for Machine Learning, University of Adelaide, Australia MBZUAI, Abu Dhabi, United Arab Emirates CNRS, IRL CROSSING, Adelaide, Australia Vincent Gripon EMAIL IMT Atlantique, Lab-STICC, UMR CNRS 6285, F-29238, France Jack Valmadre EMAIL Australian Institute for Machine Learning, University of Adelaide, Australia CNRS, IRL CROSSING, Adelaide, Australia
Pseudocode	Yes	Algorithm 1 Predominant evaluation algorithm 1: procedure Evaluate(T, K, S, Q, C, {Xc}c C) T tasks, K ways, S shots, Q queries, set of classes C, set of data samples {Xc}c C 2: for t = 1, . . . , T do 3: K take(K, shuffle(C)) 4: for c K do 5: Sc, Qc split(S, take(S + Q, shuffle(Xc))) 6: end for 7: f S Few Shot Learn(S) S := {Sc}c K 8: At 1 KQ P c K P x Qc 1[f S(x) = c] 9: end for 10: A Mean(A) Mean(A) := 1 T PT t=1 At Var(A) Var(A) := 1 T 1 PT t=1(At A)2 12: σ A σA/ T 13: return A 1.96σ A 1.96σ A = r(plimit = 95%)σ A 14: end procedure
Open Source Code	Yes	Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We also introduce a new optimized benchmark, which can be accessed at https://github.com/Raf Laf/FSL-benchmark-again.
Open Datasets	Yes	In our experiments, we utilize datasets from the Metadataset Benchmark as referenced in Triantafillou et al. (2019). This benchmark comprises 10 datasets, out of which we employ 9, excluding Imagenet, to focus on cross-domain results in line with the recent trend in the literature (Zhou et al., 2022b). These include Omniglot (handwritten characters), Aircraft, CUB (birds), DTD (textures), Fungi, VGG Flowers, Traffic Signs, Quickdraw (crowd-sourced drawings) and MSCOCO (common objects) (Lake et al., 2015; Maji et al., 2013; Wah et al., 2011; Cimpoi et al., 2014; Schroeder & Cui, 2018; Nilsback & Zisserman, 2008; Houben et al., 2013; Jongejan et al., 2016; Lin et al., 2014).
Dataset Splits	Yes	A few-shot classification task T = (K, S, Q) comprises a set of classes K, a support set S = {Sc}c K and a query set Q = {Qc}c K where Sc, Qc denote the sets of support and query examples for each class c K. Let K = \|K\| denote the number of ways (i.e. classes in a few-shot task), S = \|Sc\| the number of shots per class, and Q = \|Qc\| the number of queries per class (for simplicity, we assume the classes to be balanced). Few-shot evaluation is typically performed by constructing many tasks from a larger evaluation dataset.
Hardware Specification	No	This work was performed using HPC resources from GENCI IDRIS (Grant 202123656B). This does not specify any exact GPU/CPU models or memory.
Software Dependencies	No	The paper mentions models like CLIP (Radford et al., 2021) and DINO (Caron et al., 2021) and DINOv2, but it does not provide specific software dependencies or their version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup	Yes	Luo et al. (2023) details few-shot accuracies for 2000 tasks with 5-shots, 5 ways, and 15 queries in a comprehensive table covering various works on the Metadataset datasets. Our study s only difference lies in the adoption of the T = 600 setting, a more prevalent choice in existing literature. If CCIs are found to be narrower than OCIs with this smaller T, it will be even starker with T = 2000 tasks as shown in Equation 3.