reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RankSHAP: Shapley Value Based Feature Attributions for Learning to Rank

Authors: Tanya Chowdhury, Yair Zick, James Allan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the Rank SHAP framework through extensive experiments on two datasets, multiple ranking methods and evaluation metrics. Additionally, a user study confirms Rank SHAP s alignment with human intuition. We also perform an axiomatic analysis of existing rank attribution algorithms to determine their compliance with our proposed axioms. Ultimately, our aim is to equip practitioners with a set of axiomatically backed feature attribution methods for studying IR ranking models, that ensure generality as well as consistency.
Researcher Affiliation	Academia	Tanya Chowdhury, Yair Zick, James Allan Manning College of Information and Computer Sciences University of Massachusetts Amherst EMAIL
Pseudocode	No	Below we write it for ranking attributions and name it Kernel-Rank SHAP. Let G be the class of all linear additive attributions. ϕR(f R, x, i) = arg ming G L(f R, g, π x) where L(f R, g, π x) = P z Z[NDCG(f R( z)) NDCG(g( z))]2π x( z) and π x( z) = (m 1) ( m \| z\|)\| z\|(m \| z\|) .
Open Source Code	No	Ultimately, we advocate for practitioners to adopt Rank SHAP-based, axiomatically grounded feature attribution methods as the reference standard for their IR explanation needs. Once the work is accepted, we plan to release our code as a python library.
Open Datasets	Yes	Dataset: We test our hypothesis on two datsets : (i) the MS MARCO (msm, 2016) passage reranking dataset (ii) the TREC 2004 Robust track dataset (Robust04, (Voorhees et al., 2003)). MS MARCO is a large-scale dataset aggregated from anonymized Bing search queries containing > 8M passages from diverse text sources. The average length of a passage in the MS MARCO dataset is 1131... MS MARCO: Microsoft Machine Reading Comprehension. https://microsoft.github. io/msmarco/, 2016. Accessed: [3.5.2023].
Dataset Splits	Yes	Similar to related works, we randomly sample 250 queries from the test sets of the MS MARCO and Robust04 datasets, retrieving the 100 highest scoring documents for each query.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions using a "python library" in the future and various models and frameworks (BM25, BERT, T5, LLAMA2, Kernel SHAP, LIME, Neural NDCG) but does not provide specific version numbers for any of these software components or libraries.
Experiment Setup	Yes	Similar to related works, we randomly sample 250 queries from the test sets of the MS MARCO and Robust04 datasets, retrieving the 100 highest scoring documents for each query. Using these query-document sets, we apply the ranking models to obtain an ordered list of documents. We then generate feature attributions for the top-10, top-20, and top-100 documents. Stemmed tokens from the vocabulary of the query-document sets as bag of words make up the features for which we generate attributions. Binary feature values are assigned based on their presence or absence. I.e if a feature (token) is excluded from a coalition, all occurrences of that token are omitted from the query and documents for that pass of the model (Ribeiro et al., 2016)... Each algorithm is limited to 5,000 neighborhood samples per decision. For evaluation, we process only the top 7 most significant features produced by each algorithm, adhering to human comprehension limits (Chowdhury et al., 2023).