reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Assisting Human Decisions in Document Matching

Authors: Joon Sik Kim, Valerie Chen, Danish Pruthi, Nihar B Shah, Ameet Talwalkar

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a crowdsourced (N = 271 participants) study, we find that providing black-box model explanations reduces users accuracy on the matching task, contrary to the commonly-held belief that they can be helpful by allowing better understanding of the model.
Researcher Affiliation	Academia	Joon Sik Kim EMAIL Carnegie Mellon University Valerie Chen EMAIL Carnegie Mellon University Danish Pruthi EMAIL Indian Institute of Science, Bangalore Nihar B. Shah EMAIL Carnegie Mellon University Ameet Talwalkar EMAIL Carnegie Mellon University
Pseudocode	No	The paper describes the methods textually in Section 3.2 (Tested Methods) but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code used for the study is available at https://github.com/wnstlr/document-matching.
Open Datasets	Yes	Here, the query and candidate documents are each sampled from human-written summaries and news articles in the CNN/Daily Mail dataset (Hermann et al., 2015; See et al., 2017), a common NLP dataset used for summarization task.
Dataset Splits	Yes	We present 16 questions to each participant. The 16 questions comprise 4 easy and 12 hard questions in random order.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running the experiments.
Software Dependencies	Yes	We use the native implementation of the method (https://github.com/slundberg/shap), version 0.40.0... We use the native implementation of the method (https://github.com/nlpyang/PreSumm)... We use NLTK(https://www.nltk.org/) library to first tokenize the candidate articles... Then we use the Python package rouge-score(https://pypi.org/project/rouge-score/)... Then we use sentence-transformers(https://www.sbert.net/index.html) library version 2.2.2 (model used: all-Mini LM-L6-v2) to obtain sentence embeddings...
Experiment Setup	Yes	We present 16 questions to each participant. The 16 questions comprise 4 easy and 12 hard questions in random order. Participants complete all questions in one sitting. For each question, participants see a query summary followed by three longer candidate articles... we limit participants to spend 3 minutes to answer each question... We offer bonus payments to encourage high-quality responses in terms of both accuracy and time (more details in Appendix D.4). We recruit 275 participants from a balanced pool of adult males and females located in the U.S. with minimum approval ratings of 90% on Prolific (www.prolific.co)...