RankSHAP: Shapley Value Based Feature Attributions for Learning to Rank
Authors: Tanya Chowdhury, Yair Zick, James Allan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the Rank SHAP framework through extensive experiments on two datasets, multiple ranking methods and evaluation metrics. Additionally, a user study confirms Rank SHAP s alignment with human intuition. We also perform an axiomatic analysis of existing rank attribution algorithms to determine their compliance with our proposed axioms. Ultimately, our aim is to equip practitioners with a set of axiomatically backed feature attribution methods for studying IR ranking models, that ensure generality as well as consistency. |
| Researcher Affiliation | Academia | Tanya Chowdhury, Yair Zick, James Allan Manning College of Information and Computer Sciences University of Massachusetts Amherst EMAIL |
| Pseudocode | No | Below we write it for ranking attributions and name it Kernel-Rank SHAP. Let G be the class of all linear additive attributions. ϕR(f R, x, i) = arg ming G L(f R, g, π x) where L(f R, g, π x) = P z Z[NDCG(f R( z)) NDCG(g( z))]2π x( z) and π x( z) = (m 1) ( m | z|)| z|(m | z|) . |
| Open Source Code | No | Ultimately, we advocate for practitioners to adopt Rank SHAP-based, axiomatically grounded feature attribution methods as the reference standard for their IR explanation needs. Once the work is accepted, we plan to release our code as a python library. |
| Open Datasets | Yes | Dataset: We test our hypothesis on two datsets : (i) the MS MARCO (msm, 2016) passage reranking dataset (ii) the TREC 2004 Robust track dataset (Robust04, (Voorhees et al., 2003)). MS MARCO is a large-scale dataset aggregated from anonymized Bing search queries containing > 8M passages from diverse text sources. The average length of a passage in the MS MARCO dataset is 1131... MS MARCO: Microsoft Machine Reading Comprehension. https://microsoft.github. io/msmarco/, 2016. Accessed: [3.5.2023]. |
| Dataset Splits | Yes | Similar to related works, we randomly sample 250 queries from the test sets of the MS MARCO and Robust04 datasets, retrieving the 100 highest scoring documents for each query. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions using a "python library" in the future and various models and frameworks (BM25, BERT, T5, LLAMA2, Kernel SHAP, LIME, Neural NDCG) but does not provide specific version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | Similar to related works, we randomly sample 250 queries from the test sets of the MS MARCO and Robust04 datasets, retrieving the 100 highest scoring documents for each query. Using these query-document sets, we apply the ranking models to obtain an ordered list of documents. We then generate feature attributions for the top-10, top-20, and top-100 documents. Stemmed tokens from the vocabulary of the query-document sets as bag of words make up the features for which we generate attributions. Binary feature values are assigned based on their presence or absence. I.e if a feature (token) is excluded from a coalition, all occurrences of that token are omitted from the query and documents for that pass of the model (Ribeiro et al., 2016)... Each algorithm is limited to 5,000 neighborhood samples per decision. For evaluation, we process only the top 7 most significant features produced by each algorithm, adhering to human comprehension limits (Chowdhury et al., 2023). |