reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Retrieval meets Long Context Large Language Models

Authors: Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, Bryan Catanzaro

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we answer these questions by studying both solutions using two state-of-the-art pretrained LLMs, i.e., a proprietary 43B GPT and Llama2-70B. Perhaps surprisingly, we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation. More importantly, we demonstrate that retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes.
Researcher Affiliation	Industry	Peng Xu , Wei Ping , Xianchao Wu, Lawrence Mc Afee Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina Mohammad Shoeybi, Bryan Catanzaro NVIDIA EMAIL
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide a link or explicit statement about releasing open-source code for its methodology.
Open Datasets	Yes	Specifically, we include four datasets from the validation set of the Scroll benchmark (Shaham et al., 2022). QMSum (QM) (Zhong et al., 2021) is a query-based summarization dataset... Qasper (QASP) (Dasigi et al., 2021) is a question answering dataset... Narrative QA (NQA) (Koˇciský et al., 2018) is an established question answering dataset... Qu ALITY (QLTY) (Pang et al., 2022) is a question answering dataset... We take another three datasets from Long Bench (Bai et al., 2023). Hotpot QA (HQA) (Yang et al., 2018) is a Wikipedia-based question-answer dataset... Mu Si Que (MSQ) (Trivedi et al., 2022) is another multi-hop question answering dataset... Multi Field QA-en (MFQA) (Bai et al., 2023) was manually curated...
Dataset Splits	Yes	Specifically, we include four datasets from the validation set of the Scroll benchmark (Shaham et al., 2022).
Hardware Specification	No	The paper mentions GPUs generally but does not specify any particular GPU model (e.g., NVIDIA A100, Tesla V100) or other hardware details used for their experiments.
Software Dependencies	No	The paper refers to specific models and techniques (e.g., Ro PE embeddings, Flash Attention, Dragon, Contriever, Open AI embedding) but does not list specific version numbers for any software libraries or dependencies used in their experiments.
Experiment Setup	Yes	We extend the 4K context window to 16K for GPT-43B. For Llama2, we extend its 4K context window to 32k for Llama2-7B and both 16K and 32K for Llama2-70B. We follow Chen et al. (2023) and finetune both LLMs on the Pile dataset (Gao et al., 2021) with batch size as 128, constant learning rate of 5e-6 to adapt the position embeddings. We finetune the LLM by taking the loss only on the {Answer} part with batch size 128 and learning rate of 5e-6 for 1000 steps.