Assisting Human Decisions in Document Matching
Authors: Joon Sik Kim, Valerie Chen, Danish Pruthi, Nihar B Shah, Ameet Talwalkar
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a crowdsourced (N = 271 participants) study, we find that providing black-box model explanations reduces users accuracy on the matching task, contrary to the commonly-held belief that they can be helpful by allowing better understanding of the model. |
| Researcher Affiliation | Academia | Joon Sik Kim EMAIL Carnegie Mellon University Valerie Chen EMAIL Carnegie Mellon University Danish Pruthi EMAIL Indian Institute of Science, Bangalore Nihar B. Shah EMAIL Carnegie Mellon University Ameet Talwalkar EMAIL Carnegie Mellon University |
| Pseudocode | No | The paper describes the methods textually in Section 3.2 (Tested Methods) but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code used for the study is available at https://github.com/wnstlr/document-matching. |
| Open Datasets | Yes | Here, the query and candidate documents are each sampled from human-written summaries and news articles in the CNN/Daily Mail dataset (Hermann et al., 2015; See et al., 2017), a common NLP dataset used for summarization task. |
| Dataset Splits | Yes | We present 16 questions to each participant. The 16 questions comprise 4 easy and 12 hard questions in random order. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running the experiments. |
| Software Dependencies | Yes | We use the native implementation of the method (https://github.com/slundberg/shap), version 0.40.0... We use the native implementation of the method (https://github.com/nlpyang/PreSumm)... We use NLTK(https://www.nltk.org/) library to first tokenize the candidate articles... Then we use the Python package rouge-score(https://pypi.org/project/rouge-score/)... Then we use sentence-transformers(https://www.sbert.net/index.html) library version 2.2.2 (model used: all-Mini LM-L6-v2) to obtain sentence embeddings... |
| Experiment Setup | Yes | We present 16 questions to each participant. The 16 questions comprise 4 easy and 12 hard questions in random order. Participants complete all questions in one sitting. For each question, participants see a query summary followed by three longer candidate articles... we limit participants to spend 3 minutes to answer each question... We offer bonus payments to encourage high-quality responses in terms of both accuracy and time (more details in Appendix D.4). We recruit 275 participants from a balanced pool of adult males and females located in the U.S. with minimum approval ratings of 90% on Prolific (www.prolific.co)... |