Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models
Authors: Orion Weller, Ben Van Durme, Dawn Lawrie, Ashwin Paranjape, Yuhao Zhang, Jack Hessel
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Promptriever not only achieves strong performance on standard retrieval tasks, but also follows instructions. We observe: (1) large gains (reaching So TA) on following detailed relevance instructions (+14.3 p-MRR / +3.1 n DCG on Follow IR), (2) significantly increased robustness to lexical choices/phrasing in the query+instruction (+12.9 Robustness@10 on Instruct IR), and (3) the ability to perform hyperparameter search via prompting to reliably improve retrieval performance (+1.4 average increase on BEIR). |
| Researcher Affiliation | Collaboration | ιJohns Hopkins University αSamaya AI EMAIL |
| Pseudocode | No | The paper describes methods and processes in narrative text and figures, but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | Code and data are available at https://github.com/orionw/promptriever |
| Open Datasets | Yes | To train Promptriever, we curate and release a new instance-level instruction training set from MS MARCO (Nguyen et al., 2016)... We evaluate on in-domain (MS MARCO), out-of-domain (Thakur et al., 2021, BEIR), and instruction-following retrieval datasets, including Instruct IR (Oh et al., 2024) and Follow IR (Weller et al., 2024). |
| Dataset Splits | Yes | We use the following settings for testing prompts, following the standards in the LM community; typically one would evaluate prompts for an LM by first using a small validation set. We sample 10 queries from each of the validation (or train if there is no validation set) to use as the prompt tuning set... The metrics used are n DCG@10 for BEIR, TREC DL19 (Craswell et al., 2020) and DL20 (Soboroff, 2021), and MRR for MS MARCO Dev. |
| Hardware Specification | Yes | Training takes approximately 2 days on 8x40GB A100s for the ablation runs and 4 days for the full run. |
| Software Dependencies | Yes | We use the following hyperparameters as given by the authors of the Rep LLa MA on their Github page using Tevatron (Gao et al., 2022). This is using meta-llama/Llama-2-7b-hf with lora r 32, lora modules q_proj, k_proj, v_proj, o_proj, down_proj, up_proj, gate_proj, enabled bfloat16, using eos pooling, using normalization, a temperature of 0.01, learning rate of 1e-4, one epoch, passage length 256, 100 warm up steps, a train group size of 16, and an effective batch size of 128 (4 GPUs, 8 per device with a 4 accumulation steps). |
| Experiment Setup | Yes | We use the same learning rate and other hyperparameter details as the original Rep LLa MA for a fair comparison (see Appendix E for more details). ... Appendix E: ... learning rate of 1e-4, one epoch, passage length 256, 100 warm up steps, a train group size of 16, and an effective batch size of 128 (4 GPUs, 8 per device with a 4 accumulation steps). |