Assisting Human Decisions in Document Matching

Authors: Joon Sik Kim, Valerie Chen, Danish Pruthi, Nihar B Shah, Ameet Talwalkar

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a crowdsourced (N = 271 participants) study, we find that providing black-box model explanations reduces users accuracy on the matching task, contrary to the commonly-held belief that they can be helpful by allowing better understanding of the model.
Researcher Affiliation Academia Joon Sik Kim EMAIL Carnegie Mellon University Valerie Chen EMAIL Carnegie Mellon University Danish Pruthi EMAIL Indian Institute of Science, Bangalore Nihar B. Shah EMAIL Carnegie Mellon University Ameet Talwalkar EMAIL Carnegie Mellon University
Pseudocode No The paper describes the methods textually in Section 3.2 (Tested Methods) but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes The code used for the study is available at https://github.com/wnstlr/document-matching.
Open Datasets Yes Here, the query and candidate documents are each sampled from human-written summaries and news articles in the CNN/Daily Mail dataset (Hermann et al., 2015; See et al., 2017), a common NLP dataset used for summarization task.
Dataset Splits Yes We present 16 questions to each participant. The 16 questions comprise 4 easy and 12 hard questions in random order.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running the experiments.
Software Dependencies Yes We use the native implementation of the method (https://github.com/slundberg/shap), version 0.40.0... We use the native implementation of the method (https://github.com/nlpyang/PreSumm)... We use NLTK(https://www.nltk.org/) library to first tokenize the candidate articles... Then we use the Python package rouge-score(https://pypi.org/project/rouge-score/)... Then we use sentence-transformers(https://www.sbert.net/index.html) library version 2.2.2 (model used: all-Mini LM-L6-v2) to obtain sentence embeddings...
Experiment Setup Yes We present 16 questions to each participant. The 16 questions comprise 4 easy and 12 hard questions in random order. Participants complete all questions in one sitting. For each question, participants see a query summary followed by three longer candidate articles... we limit participants to spend 3 minutes to answer each question... We offer bonus payments to encourage high-quality responses in terms of both accuracy and time (more details in Appendix D.4). We recruit 275 participants from a balanced pool of adult males and females located in the U.S. with minimum approval ratings of 90% on Prolific (www.prolific.co)...