RankSHAP: Shapley Value Based Feature Attributions for Learning to Rank

Authors: Tanya Chowdhury, Yair Zick, James Allan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the Rank SHAP framework through extensive experiments on two datasets, multiple ranking methods and evaluation metrics. Additionally, a user study confirms Rank SHAP s alignment with human intuition. We also perform an axiomatic analysis of existing rank attribution algorithms to determine their compliance with our proposed axioms. Ultimately, our aim is to equip practitioners with a set of axiomatically backed feature attribution methods for studying IR ranking models, that ensure generality as well as consistency.
Researcher Affiliation Academia Tanya Chowdhury, Yair Zick, James Allan Manning College of Information and Computer Sciences University of Massachusetts Amherst EMAIL
Pseudocode No Below we write it for ranking attributions and name it Kernel-Rank SHAP. Let G be the class of all linear additive attributions. ϕR(f R, x, i) = arg ming G L(f R, g, π x) where L(f R, g, π x) = P z Z[NDCG(f R( z)) NDCG(g( z))]2π x( z) and π x( z) = (m 1) ( m | z|)| z|(m | z|) .
Open Source Code No Ultimately, we advocate for practitioners to adopt Rank SHAP-based, axiomatically grounded feature attribution methods as the reference standard for their IR explanation needs. Once the work is accepted, we plan to release our code as a python library.
Open Datasets Yes Dataset: We test our hypothesis on two datsets : (i) the MS MARCO (msm, 2016) passage reranking dataset (ii) the TREC 2004 Robust track dataset (Robust04, (Voorhees et al., 2003)). MS MARCO is a large-scale dataset aggregated from anonymized Bing search queries containing > 8M passages from diverse text sources. The average length of a passage in the MS MARCO dataset is 1131... MS MARCO: Microsoft Machine Reading Comprehension. https://microsoft.github. io/msmarco/, 2016. Accessed: [3.5.2023].
Dataset Splits Yes Similar to related works, we randomly sample 250 queries from the test sets of the MS MARCO and Robust04 datasets, retrieving the 100 highest scoring documents for each query.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions using a "python library" in the future and various models and frameworks (BM25, BERT, T5, LLAMA2, Kernel SHAP, LIME, Neural NDCG) but does not provide specific version numbers for any of these software components or libraries.
Experiment Setup Yes Similar to related works, we randomly sample 250 queries from the test sets of the MS MARCO and Robust04 datasets, retrieving the 100 highest scoring documents for each query. Using these query-document sets, we apply the ranking models to obtain an ordered list of documents. We then generate feature attributions for the top-10, top-20, and top-100 documents. Stemmed tokens from the vocabulary of the query-document sets as bag of words make up the features for which we generate attributions. Binary feature values are assigned based on their presence or absence. I.e if a feature (token) is excluded from a coalition, all occurrences of that token are omitted from the query and documents for that pass of the model (Ribeiro et al., 2016)... Each algorithm is limited to 5,000 neighborhood samples per decision. For evaluation, we process only the top 7 most significant features produced by each algorithm, adhering to human comprehension limits (Chowdhury et al., 2023).