reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

Authors: Orion Weller, Ben Van Durme, Dawn Lawrie, Ashwin Paranjape, Yuhao Zhang, Jack Hessel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Promptriever not only achieves strong performance on standard retrieval tasks, but also follows instructions. We observe: (1) large gains (reaching So TA) on following detailed relevance instructions (+14.3 p-MRR / +3.1 n DCG on Follow IR), (2) significantly increased robustness to lexical choices/phrasing in the query+instruction (+12.9 Robustness@10 on Instruct IR), and (3) the ability to perform hyperparameter search via prompting to reliably improve retrieval performance (+1.4 average increase on BEIR).
Researcher Affiliation	Collaboration	ιJohns Hopkins University αSamaya AI EMAIL
Pseudocode	No	The paper describes methods and processes in narrative text and figures, but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	Yes	Code and data are available at https://github.com/orionw/promptriever
Open Datasets	Yes	To train Promptriever, we curate and release a new instance-level instruction training set from MS MARCO (Nguyen et al., 2016)... We evaluate on in-domain (MS MARCO), out-of-domain (Thakur et al., 2021, BEIR), and instruction-following retrieval datasets, including Instruct IR (Oh et al., 2024) and Follow IR (Weller et al., 2024).
Dataset Splits	Yes	We use the following settings for testing prompts, following the standards in the LM community; typically one would evaluate prompts for an LM by first using a small validation set. We sample 10 queries from each of the validation (or train if there is no validation set) to use as the prompt tuning set... The metrics used are n DCG@10 for BEIR, TREC DL19 (Craswell et al., 2020) and DL20 (Soboroff, 2021), and MRR for MS MARCO Dev.
Hardware Specification	Yes	Training takes approximately 2 days on 8x40GB A100s for the ablation runs and 4 days for the full run.
Software Dependencies	Yes	We use the following hyperparameters as given by the authors of the Rep LLa MA on their Github page using Tevatron (Gao et al., 2022). This is using meta-llama/Llama-2-7b-hf with lora r 32, lora modules q_proj, k_proj, v_proj, o_proj, down_proj, up_proj, gate_proj, enabled bfloat16, using eos pooling, using normalization, a temperature of 0.01, learning rate of 1e-4, one epoch, passage length 256, 100 warm up steps, a train group size of 16, and an effective batch size of 128 (4 GPUs, 8 per device with a 4 accumulation steps).
Experiment Setup	Yes	We use the same learning rate and other hyperparameter details as the original Rep LLa MA for a fair comparison (see Appendix E for more details). ... Appendix E: ... learning rate of 1e-4, one epoch, passage length 256, 100 warm up steps, a train group size of 16, and an effective batch size of 128 (4 GPUs, 8 per device with a 4 accumulation steps).