reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

Authors: Hongjin SU, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Liu Haisu, Quan Shi, Zachary Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan Arik, Danqi Chen, Tao Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT. The leading model on the MTEB leaderboard (Muennighoff et al., 2023) SFR-Embedding Mistral (Meng et al., 2024), which achieves a score of 59.0 n DCG@10,1 produces a score of n DCG@10 of 18.3 on BRIGHT. We show that incorporating explicit reasoning about the query improves retrieval performance by up to 12.2 points.
Researcher Affiliation	Collaboration	h The University of Hong Kong p Princeton University s Stanford University w University of Washington g Google Cloud AI Research
Pseudocode	No	The paper describes various data collection processes (e.g., in Section 3.2, 3.3, 3.4) and experimental steps (e.g., Section 4.1), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or structured code-like procedures.
Open Source Code	Yes	2Our code and data are available at https://github.com/xlang-ai/BRIGHT and https:// huggingface.co/datasets/xlangai/BRIGHT. To facilitate the reproduction of our experiments, the code and data are provided in https://brightbenchmark. github.io/.
Open Datasets	Yes	Our dataset consists of 1,384 real-world queries spanning diverse domains... Our code and data are available at https://github.com/xlang-ai/BRIGHT and https:// huggingface.co/datasets/xlangai/BRIGHT.
Dataset Splits	Yes	We introduce BRIGHT, a retrieval benchmark that tests whether retrieval systems can match queries and documents whose relevance requires intensive reasoning to solve... We randomly sample 142 questions from this set to construct our test set.
Hardware Specification	Yes	We run all experiments on NVIDIA V100, A100, or H100 GPUs.
Software Dependencies	No	The paper mentions specific models like 'gensim13' (for BM25) and model checkpoints like 'all-mpnet-base-v2' or 'e5-mistral-7b-instruct'. It also mentions 'Flash Attention (Dao et al., 2022; Dao, 2024)' for speedup. However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	For training with the contrastive loss, we collect 3,200 (post, answer) pairs from the Biology, Earth Science, Economics, Psychology, Robotics, and Stack Overflow sections of Stack Exchange, and 1,538 pairs from Sustainable Living... We use a small batch size of 64 to ensure sufficient learning steps, while following the other hyperparameters as outlined in Muennighoff et al. (2024). We continue training Grit LM for 10 epochs