reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Atlas: Few-shot Learning with Retrieval Augmented Language Models

Authors: Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, Edouard Grave

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform evaluations on a wide range of tasks, including MMLU, KILT and Natural Questions, and study the impact of the content of the document index, showing that it can easily be updated. Notably, Atlas reaches over 42% accuracy on Natural Questions using only 64 examples, outperforming a 540B parameter model by 3% despite having 50x fewer parameters. Keywords: retrieval augmented language models, information retrieval, language models
Researcher Affiliation	Collaboration	Gautier Izacard1,2, , EMAIL Patrick Lewis1, , EMAIL Maria Lomeli1 EMAIL Lucas Hosseini1, EMAIL Fabio Petroni1, EMAIL Timo Schick1, EMAIL Jane Dwivedi-Yu1 EMAIL Armand Joulin1, EMAIL Sebastian Riedel1,3, EMAIL Edouard Grave1, EMAIL 1 Meta AI, 2 ENS, PSL University & Inria, 3 University College London
Pseudocode	No	The paper describes methods and architectures using natural language and mathematical equations (e.g., in Section 2.2 Training Objectives for the Retriever) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code, pre-trained Atlas checkpoints, and various supporting data are available at https://github.com/facebookresearch/atlas
Open Datasets	Yes	To evaluate our retrieval-augmented language models we consider the following benchmarks, which include diﬀerent tasks. 4.1.1 Knowledge-Intensive Language Tasks (KILT) First, we use the KILT evaluation suite (Petroni et al., 2020), containing 11 data sets corresponding to 5 tasks: fact checking, question answering, dialog generation, entity linking and slot-ﬁlling. ... MMLU (Hendrycks et al., 2021) ... MCTest (Richardson et al., 2013), RACE (Lai et al., 2017), ARC (Clark et al., 2018) and OBQA (Mihaylov et al., 2018) ... Temp LAMA (Dhingra et al., 2022)
Dataset Splits	Yes	More precisely, we decided to use 50 iterations for the 64-shot setting and 200 iterations in the 1024-shot setting. In both cases, we use a batch size of 32 examples... For the 5-shot setting, Atlas outperforms GPT-3 by 4%, while using 15 less parameters, and 10 less pre-training compute... We select a subset of questions from this data set which have a diﬀerent answer in 2017 and 2020, for example, Question: Theo Walcott plays for ___ Answer: Arsenal F.C. (2017), Everton F.C. (2020), and form a small training set of 248 training, 112 development and 806 test questions.
Hardware Specification	Yes	The index is stored at fp16 precision, resulting in a total GPU memory requirement of 49 GB and 587 GB for the Wikipedia and combined indices, respectively. This large GPU memory requirement for the index limits accessibility and ease of deployment. However, many index compression techniques are available for nearest neighbour search, which can often dramatically reduce memory requirements at the cost of some retrieval accuracy. Following Izacard et al. (2020), we explore the eﬀect of Product Quantization (PQ, Jegou et al., 2010), a popular lossy compression technique on Atlas-3B s accuracy for the 64-shot NQ task at diﬀerent compression levels. The results are shown in Figure 4. We ﬁnd that substantial compression is possible before the onset of signiﬁcant performance degradation. Namely, the Wikipedia index can be compressed from 49GB to 4GB with negligible drop in retrieval precision and exact match. Likewise, the combined index can be compressed from 587GB to 50GB without serious degradation, indicating that the combined index could be loaded onto a single 80GB GPU.
Software Dependencies	No	The paper mentions models and architectures like 'BERT base architecture', 'T5 pre-trained weight', 'Contriever', and optimizers like 'Adam W', but it does not specify any software libraries or packages with version numbers for implementation.
Experiment Setup	Yes	We pre-train all our models for 10,000 iterations, using Adam W with a batch size of 64 and a learning rate of 10-4 for the reader and 10-5 for the retriever with linear decay and 1,000 warmup steps. We refresh the index every 1,000 steps... For the few-shot KILT ablation experiments, we perform a ﬁxed number of ﬁne-tuning iterations, instead of using early-stopping. More precisely, we decided to use 50 iterations for the 64-shot setting and 200 iterations in the 1024-shot setting. In both cases, we use a batch size of 32 examples, a learning rate of 4 10-5 with linear decay and 5 warmup steps for both the reader and the retriever.