reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models

Authors: Jianqun Zhou, Yuanlei Zheng, Wei Chen, Qianqian Zheng, Shang Zeyuan, Wei Zhang, Rui Meng, Xiaoyu Shen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop Info Search, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models responsiveness to instructions. Our findings indicate that although fine-tuning models on instruction-aware retrieval datasets and increasing model size enhance performance, most models still fall short of instruction compliance.1
Researcher Affiliation	Collaboration	Jianqun Zhou1 , Yuanlei Zheng4 , Wei Chen4 Qianqian Zheng1, Hui Su2, Wei Zhang1, Rui Meng3 , Xiaoyu Shen1,5 1Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, Eastern Institute of Technology, Ningbo 2Meituan Inc. 3Salesforce Research 4School of Software Engineering, Huazhong University of Science and Technology 5Engineering Research Center of Chiplet Design and Manufacturing of Zhejiang Province EMAIL, EMAIL
Pseudocode	No	The paper provides detailed descriptions of its metrics (SICR, WISE) with formulas and a 7-step construction process for the Info Search benchmark, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1We release our dataset and code on https://github.com/EIT-NLP/Info Search
Open Datasets	Yes	We release our dataset and code on https://github.com/EIT-NLP/Info Search ... Table 4: Structure and source of the dataset Dimension Source Data Condition Value Audience Bio ASQ, scifact (Muennighoff et al., 2022) Keyword MSMARCO (Bajaj et al., 2016) Language publichealth-qa Length medical qa (Muennighoff et al., 2022)
Dataset Splits	No	The Info Search benchmark comprises 600 core queries, 1,598 instructed queries, 1,598 reversely instructed queries, and 6,392 documents. ... Where \|Q\| represents the total number of queries in the test set. This formula calculate the percentage of retrievals that strictly adhere to the specified instructions relative to the total results. The paper describes the total number of queries and documents in the Info Search benchmark and refers to a 'test set' for evaluation, but it does not specify explicit training/validation/test splits (e.g., percentages or sample counts for each split) for its own dataset.
Hardware Specification	No	ACKNOWLEDGEMENT We thank EIT and IDT High Performance Computing Center for providing computational resources for this project. The paper mentions using a 'High Performance Computing Center' but does not provide any specific details about the hardware used, such as GPU/CPU models, memory, or cloud instance specifications.
Software Dependencies	No	The paper mentions various models and their underlying architectures (e.g., BERT, T5, LLMs) and libraries like PyTorch or Hugging Face implicitly by citing models, but it does not specify any particular software dependencies with version numbers.
Experiment Setup	Yes	For dense retrieval models, we compute the dot product between query and document vectors to determine retrieval rankings. For reranking models, the top 100 results from E5-mistral (Wang et al., 2023) are re-ranked based on the models interpretation of the instruction. For general large language models, we use two settings: In the point-wise setting, both the query and document are inputs, with the output probabilities of True or False used as similarity scores. In the list-wise setting, following (Pradeep et al., 2023b), a list of documents is provided as a prompt (see Appendix C), and the model returns the ranked document IDs in a list.