reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

Authors: Shijie Chen, Bernal Jimenez Gutierrez, Yu Su

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments with two popular open-weight LLMs on standard singlehop and multi-hop information retrieval benchmarks show that ICR outperforms Rank GPT while cutting the latency by more than 60% in practice.
Researcher Affiliation	Academia	Shijie Chen Bernal Jim enez Guti errez Yu Su The Ohio State University EMAIL
Pseudocode	No	The paper describes methods using mathematical equations and structured descriptions, but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code and data: https://github.com/OSU-NLP-Group/In-Context-Reranking.
Open Datasets	Yes	We evaluate our method on both single-hop information retrieval benchmarks, which emphasizes estimating the documents semantic relevance, and multi-hop retrieval tasks. In single-hop evaluations, we experiment on TREC (Craswell et al., 2020) and nine public datasets in the BEIR benchmark (Thakur et al., 2021), including TREC-COVID (Voorhees et al., 2021), NFCorpus (Boteva et al., 2016),DBPedia-entity (Hasibi et al., 2017), Sci Fact (Wadden et al., 2020), Sci Docs (Cohan et al., 2020), Fi QA (Maia et al., 2018), FEVER (Thorne et al., 2018), Climate FEVER (Diggelmann et al., 2020), and NQ (Kwiatkowski et al., 2019). In multi-hop evaluations, we sample 1000 queries from three popular multi-hop question answering datasets as in Guti errez et al. (2024), including Mu Si Que (answerable) (Trivedi et al., 2022), 2Wiki Multi Hop QA (Ho et al., 2020), and Hotpot QA (Yang et al., 2018).
Dataset Splits	Yes	Following Sun et al. (2023), we re-rank the top 100 documents returned by BM25(Robertson & Zaragoza, 2009) and report n DCG@10. Following Guti errez et al. (2024), we re-rank the top 20 retrieval results returned by Col BERT v2 (Santhanam et al., 2022) and measure recall@2 and recall@5.
Hardware Specification	Yes	This test is performed on a single NVIDIA RTX A6000 Ada GPU.
Software Dependencies	No	We implement ICR with the Transformers library (Wolf et al., 2020) by extracting attention weights from the forward() function. We implement Rank GPT using v LLM (Kwon et al., 2023), a popular LLM inference infrastructure that is much faster than Transformers to ensure a practical comparison. The paper mentions software libraries like Transformers and vLLM along with citations, but does not provide specific version numbers for these components.
Experiment Setup	Yes	To leverage the instruction following abilities in modern LLMs, we formulate the re-ranking task into either question answering (QA) or information extraction (IE) tasks, which are more commonly seen by LLMs. We design two kinds of instruction prompts for different type of queries: QA style: For search queries that are questions, we prompt the LLM to answer the question based on provided documents, leveraging its QA abilities. IE style: For non-question queries, we prompt the LLM to find relevant information from provided documents, leveraging its IE abilities. The prompt consists of the instruction, the documents to be re-ranked, and the search query. We reverse the order of input documents as we find it improves performance in our preliminary experiments, probably due to the position bias in LLMs. Given the autoregressive nature of decoder-only LLMs, we place the query at the end of the prompt. Base LLMs. As in-context re-ranking requires access to the attention distribution of all layers and heads in an LLM, we experiment with open-weight LLMs. We choose two popular instructiontuned models: Mistral 7B (Jiang et al., 2023) and Llama-3.1 8B (Dubey et al., 2024).