Reasoning Elicitation in Language Models via Counterfactual Feedback
Authors: Alihan Hüyük, Xinnuo Xu, Jacqueline Maasch, Aditya Nori, Javier Hernandez
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first derive novel metrics that balance accuracy in factual and counterfactual questions, capturing a more complete view of the reasoning abilities of language models than traditional factual-only based metrics. Second, we propose several fine-tuning approaches that aim to elicit better reasoning mechanisms, in the sense of the proposed metrics. Finally, we evaluate the performance of the fine-tuned language models in a variety of realistic scenarios. In particular, we investigate to what extent our fine-tuning approaches systemically achieve better generalization with respect to the base models in several problems that require, among others, inductive and deductive reasoning capabilities. (...) 5 EXPERIMENTS |
| Researcher Affiliation | Collaboration | Alihan H uy uk, Xinnuo Xu, Jacqueline Maasch, Aditya V. Nori, Javier Gonz alez Harvard University, Microsoft Research Cambridge, Cornell Tech |
| Pseudocode | Yes | Algorithm 1 Supervised Counterfactual Feedback (...) Algorithm 2 Preference-based Counterfactual Feedback (...) Algorithm 3 Preference-based Causal Consistency Feedback |
| Open Source Code | No | No explicit statement regarding the release of their own source code for the methodology is provided. The paper references third-party models like Phi-3 mini from Hugging Face but does not offer a link to their specific implementation. |
| Open Datasets | Yes | We present three real-world causal reasoning problems: in the Healthcare domain, we examine breast cancer treatment and develop a simplified problem that determines how different treatment options namely, radiotherapy/chemotherapy and surgery are assigned to patients based on cancer type, tumor size, and nodal involvement. This model is grounded in a real-world guideline (MD Anderson Cancer Center) and published statistics on the disease (Orrantia Borunda et al., 2022; Sezgın et al., 2020; Carey et al., 2006). In the Engineering domain, we implement an automatic fault detection algorithm for transmission lines (Reddy et al., 2016). (...) In the Math Benchmarking domain, we select a math question from GSM8K (Cobbe et al., 2021), a widely used benchmark for evaluating language models on grade school math problems. |
| Dataset Splits | No | The paper describes generating datasets for fine-tuning and evaluation scenarios (in-domain, generalization modes) by sampling '100 contexts per causal relationship' and generating '10 answers for each question per context'. It mentions 'held-out set of test samples' and distinct training and evaluation phases. However, it does not provide specific numerical splits (e.g., percentages or exact counts for train/validation/test from a fixed dataset) in the traditional sense, but rather describes how different *types* of causal relationships are used for training versus testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running its experiments, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions using specific language models like 'Phi-3 mini' and 'Llama 3 8B' and refers to Hugging Face models. It does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed for replication. |
| Experiment Setup | No | The paper describes the fine-tuning methods (SFT, DPO, DPO+CCF) and data generation strategies, but it does not provide specific hyperparameter values such as learning rate, batch size, number of epochs, or optimizer settings used for the fine-tuning process. |