Atlas: Few-shot Learning with Retrieval Augmented Language Models

Authors: Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, Edouard Grave

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform evaluations on a wide range of tasks, including MMLU, KILT and Natural Questions, and study the impact of the content of the document index, showing that it can easily be updated. Notably, Atlas reaches over 42% accuracy on Natural Questions using only 64 examples, outperforming a 540B parameter model by 3% despite having 50x fewer parameters. Keywords: retrieval augmented language models, information retrieval, language models
Researcher Affiliation Collaboration Gautier Izacard1,2, , EMAIL Patrick Lewis1, , EMAIL Maria Lomeli1 EMAIL Lucas Hosseini1, EMAIL Fabio Petroni1, EMAIL Timo Schick1, EMAIL Jane Dwivedi-Yu1 EMAIL Armand Joulin1, EMAIL Sebastian Riedel1,3, EMAIL Edouard Grave1, EMAIL 1 Meta AI, 2 ENS, PSL University & Inria, 3 University College London
Pseudocode No The paper describes methods and architectures using natural language and mathematical equations (e.g., in Section 2.2 Training Objectives for the Retriever) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code, pre-trained Atlas checkpoints, and various supporting data are available at https://github.com/facebookresearch/atlas
Open Datasets Yes To evaluate our retrieval-augmented language models we consider the following benchmarks, which include different tasks. 4.1.1 Knowledge-Intensive Language Tasks (KILT) First, we use the KILT evaluation suite (Petroni et al., 2020), containing 11 data sets corresponding to 5 tasks: fact checking, question answering, dialog generation, entity linking and slot-filling. ... MMLU (Hendrycks et al., 2021) ... MCTest (Richardson et al., 2013), RACE (Lai et al., 2017), ARC (Clark et al., 2018) and OBQA (Mihaylov et al., 2018) ... Temp LAMA (Dhingra et al., 2022)
Dataset Splits Yes More precisely, we decided to use 50 iterations for the 64-shot setting and 200 iterations in the 1024-shot setting. In both cases, we use a batch size of 32 examples... For the 5-shot setting, Atlas outperforms GPT-3 by 4%, while using 15 less parameters, and 10 less pre-training compute... We select a subset of questions from this data set which have a different answer in 2017 and 2020, for example, Question: Theo Walcott plays for ___ Answer: Arsenal F.C. (2017), Everton F.C. (2020), and form a small training set of 248 training, 112 development and 806 test questions.
Hardware Specification Yes The index is stored at fp16 precision, resulting in a total GPU memory requirement of 49 GB and 587 GB for the Wikipedia and combined indices, respectively. This large GPU memory requirement for the index limits accessibility and ease of deployment. However, many index compression techniques are available for nearest neighbour search, which can often dramatically reduce memory requirements at the cost of some retrieval accuracy. Following Izacard et al. (2020), we explore the effect of Product Quantization (PQ, Jegou et al., 2010), a popular lossy compression technique on Atlas-3B s accuracy for the 64-shot NQ task at different compression levels. The results are shown in Figure 4. We find that substantial compression is possible before the onset of significant performance degradation. Namely, the Wikipedia index can be compressed from 49GB to 4GB with negligible drop in retrieval precision and exact match. Likewise, the combined index can be compressed from 587GB to 50GB without serious degradation, indicating that the combined index could be loaded onto a single 80GB GPU.
Software Dependencies No The paper mentions models and architectures like 'BERT base architecture', 'T5 pre-trained weight', 'Contriever', and optimizers like 'Adam W', but it does not specify any software libraries or packages with version numbers for implementation.
Experiment Setup Yes We pre-train all our models for 10,000 iterations, using Adam W with a batch size of 64 and a learning rate of 10-4 for the reader and 10-5 for the retriever with linear decay and 1,000 warmup steps. We refresh the index every 1,000 steps... For the few-shot KILT ablation experiments, we perform a fixed number of fine-tuning iterations, instead of using early-stopping. More precisely, we decided to use 50 iterations for the 64-shot setting and 200 iterations in the 1024-shot setting. In both cases, we use a batch size of 32 examples, a learning rate of 4 10-5 with linear decay and 5 warmup steps for both the reader and the retriever.