reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Authors: Katie Matton, Robert Ness, John Guttag, Emre Kiciman

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our method on two question-answering datasets and three LLMs: GPT-3.5 and GPT-4o from Open AI (2024) and Claude-3.5-Sonnet from Anthropic (2024). In doing so, we reveal new insights about patterns of LLM unfaithfulness. On a social bias task, we not only identify patterns of unfaithfulness reported in prior work on that dataset (hiding social bias), but also discover a new one (hiding the influence of safety measures). On a medical question answering task, we uncover cases where LLMs provide misleading claims about which pieces of evidence influenced their decisions.
Researcher Affiliation	Collaboration	Katie Matton MIT EMAIL Robert Osazuwa Ness Microsoft Research EMAIL John Guttag MIT EMAIL Emre Kıcıman Microsoft Research EMAIL
Pseudocode	No	The paper describes a novel method for estimating causal concept faithfulness in Section 3 and details the Bayesian hierarchical models used in Appendix C.2 and C.3. However, these methods are described in narrative and mathematical text, not as structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/kmatton/walk-the-talk.
Open Datasets	Yes	Data. The task consists of questions adapted from the Bias Benchmark QA (BBQ) (Parrish et al., 2022), a dataset developed to test for social biases in language models. Data. We use the Med QA benchmark (Jin et al., 2021), which consists of medical licensing exam questions.
Dataset Splits	Yes	Due to cost constraints, we sub-sample 30 questions stratified across nine social bias categories (e.g., race, gender, etc.). We collect 50 LLM responses per question (S = 50) using a few-shot, chain-of-thought prompt. We focus on Type 2 questions and randomly sample 30 for our analysis.
Hardware Specification	No	The paper mentions evaluating models like GPT-3.5, GPT-4o, Claude-3.5-Sonnet, and Llama-3.1-8B, which are cloud-based services or pre-trained models. It does not specify the hardware (e.g., GPUs, CPUs) used by the authors to run their experiments or interact with these models.
Software Dependencies	No	The paper extensively mentions various LLMs such as GPT-3.5, GPT-4o, Claude-3.5-Sonnet, and Llama-3.1-8B as the subjects of evaluation. It also refers to a Bayesian hierarchical model but does not specify any particular software libraries or their versions (e.g., PyTorch, TensorFlow, or specific statistical packages) used for its implementation or for other parts of the methodology.
Experiment Setup	Yes	We use GPT-4o as the auxiliary LLM to assist with counterfactual question creation... We collect 50 LLM responses per question (S = 50) using a few-shot, chain-of-thought prompt. We use a temperature of 0 to make the outputs close to deterministic. In all experiments, for all of the LLMs that we analyzed, we use a temperature of 0.7. For the GPT models, we set the max tokens to 256. For Claude-3.5-Sonnet, we found that with a token limit of 256, the responses were often cutoff mid sentence. Therefore, we set the max tokens to 512 for Claude-3.5-Sonnet.