BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
Authors: Hongjin SU, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Liu Haisu, Quan Shi, Zachary Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan Arik, Danqi Chen, Tao Yu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT. The leading model on the MTEB leaderboard (Muennighoff et al., 2023) SFR-Embedding Mistral (Meng et al., 2024), which achieves a score of 59.0 n DCG@10,1 produces a score of n DCG@10 of 18.3 on BRIGHT. We show that incorporating explicit reasoning about the query improves retrieval performance by up to 12.2 points. |
| Researcher Affiliation | Collaboration | h The University of Hong Kong p Princeton University s Stanford University w University of Washington g Google Cloud AI Research |
| Pseudocode | No | The paper describes various data collection processes (e.g., in Section 3.2, 3.3, 3.4) and experimental steps (e.g., Section 4.1), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or structured code-like procedures. |
| Open Source Code | Yes | 2Our code and data are available at https://github.com/xlang-ai/BRIGHT and https:// huggingface.co/datasets/xlangai/BRIGHT. To facilitate the reproduction of our experiments, the code and data are provided in https://brightbenchmark. github.io/. |
| Open Datasets | Yes | Our dataset consists of 1,384 real-world queries spanning diverse domains... Our code and data are available at https://github.com/xlang-ai/BRIGHT and https:// huggingface.co/datasets/xlangai/BRIGHT. |
| Dataset Splits | Yes | We introduce BRIGHT, a retrieval benchmark that tests whether retrieval systems can match queries and documents whose relevance requires intensive reasoning to solve... We randomly sample 142 questions from this set to construct our test set. |
| Hardware Specification | Yes | We run all experiments on NVIDIA V100, A100, or H100 GPUs. |
| Software Dependencies | No | The paper mentions specific models like 'gensim13' (for BM25) and model checkpoints like 'all-mpnet-base-v2' or 'e5-mistral-7b-instruct'. It also mentions 'Flash Attention (Dao et al., 2022; Dao, 2024)' for speedup. However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For training with the contrastive loss, we collect 3,200 (post, answer) pairs from the Biology, Earth Science, Economics, Psychology, Robotics, and Stack Overflow sections of Stack Exchange, and 1,538 pairs from Sustainable Living... We use a small batch size of 64 to ensure sufficient learning steps, while following the other hyperparameters as outlined in Muennighoff et al. (2024). We continue training Grit LM for 10 epochs |