Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
Authors: Katie Matton, Robert Ness, John Guttag, Emre Kiciman
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our method on two question-answering datasets and three LLMs: GPT-3.5 and GPT-4o from Open AI (2024) and Claude-3.5-Sonnet from Anthropic (2024). In doing so, we reveal new insights about patterns of LLM unfaithfulness. On a social bias task, we not only identify patterns of unfaithfulness reported in prior work on that dataset (hiding social bias), but also discover a new one (hiding the influence of safety measures). On a medical question answering task, we uncover cases where LLMs provide misleading claims about which pieces of evidence influenced their decisions. |
| Researcher Affiliation | Collaboration | Katie Matton MIT EMAIL Robert Osazuwa Ness Microsoft Research EMAIL John Guttag MIT EMAIL Emre Kıcıman Microsoft Research EMAIL |
| Pseudocode | No | The paper describes a novel method for estimating causal concept faithfulness in Section 3 and details the Bayesian hierarchical models used in Appendix C.2 and C.3. However, these methods are described in narrative and mathematical text, not as structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/kmatton/walk-the-talk. |
| Open Datasets | Yes | Data. The task consists of questions adapted from the Bias Benchmark QA (BBQ) (Parrish et al., 2022), a dataset developed to test for social biases in language models. Data. We use the Med QA benchmark (Jin et al., 2021), which consists of medical licensing exam questions. |
| Dataset Splits | Yes | Due to cost constraints, we sub-sample 30 questions stratified across nine social bias categories (e.g., race, gender, etc.). We collect 50 LLM responses per question (S = 50) using a few-shot, chain-of-thought prompt. We focus on Type 2 questions and randomly sample 30 for our analysis. |
| Hardware Specification | No | The paper mentions evaluating models like GPT-3.5, GPT-4o, Claude-3.5-Sonnet, and Llama-3.1-8B, which are cloud-based services or pre-trained models. It does not specify the hardware (e.g., GPUs, CPUs) used by the authors to run their experiments or interact with these models. |
| Software Dependencies | No | The paper extensively mentions various LLMs such as GPT-3.5, GPT-4o, Claude-3.5-Sonnet, and Llama-3.1-8B as the subjects of evaluation. It also refers to a Bayesian hierarchical model but does not specify any particular software libraries or their versions (e.g., PyTorch, TensorFlow, or specific statistical packages) used for its implementation or for other parts of the methodology. |
| Experiment Setup | Yes | We use GPT-4o as the auxiliary LLM to assist with counterfactual question creation... We collect 50 LLM responses per question (S = 50) using a few-shot, chain-of-thought prompt. We use a temperature of 0 to make the outputs close to deterministic. In all experiments, for all of the LLMs that we analyzed, we use a temperature of 0.7. For the GPT models, we set the max tokens to 256. For Claude-3.5-Sonnet, we found that with a token limit of 256, the responses were often cutoff mid sentence. Therefore, we set the max tokens to 512 for Claude-3.5-Sonnet. |