Reasoning Elicitation in Language Models via Counterfactual Feedback

Authors: Alihan Hüyük, Xinnuo Xu, Jacqueline Maasch, Aditya Nori, Javier Hernandez

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first derive novel metrics that balance accuracy in factual and counterfactual questions, capturing a more complete view of the reasoning abilities of language models than traditional factual-only based metrics. Second, we propose several fine-tuning approaches that aim to elicit better reasoning mechanisms, in the sense of the proposed metrics. Finally, we evaluate the performance of the fine-tuned language models in a variety of realistic scenarios. In particular, we investigate to what extent our fine-tuning approaches systemically achieve better generalization with respect to the base models in several problems that require, among others, inductive and deductive reasoning capabilities. (...) 5 EXPERIMENTS
Researcher Affiliation Collaboration Alihan H uy uk, Xinnuo Xu, Jacqueline Maasch, Aditya V. Nori, Javier Gonz alez Harvard University, Microsoft Research Cambridge, Cornell Tech
Pseudocode Yes Algorithm 1 Supervised Counterfactual Feedback (...) Algorithm 2 Preference-based Counterfactual Feedback (...) Algorithm 3 Preference-based Causal Consistency Feedback
Open Source Code No No explicit statement regarding the release of their own source code for the methodology is provided. The paper references third-party models like Phi-3 mini from Hugging Face but does not offer a link to their specific implementation.
Open Datasets Yes We present three real-world causal reasoning problems: in the Healthcare domain, we examine breast cancer treatment and develop a simplified problem that determines how different treatment options namely, radiotherapy/chemotherapy and surgery are assigned to patients based on cancer type, tumor size, and nodal involvement. This model is grounded in a real-world guideline (MD Anderson Cancer Center) and published statistics on the disease (Orrantia Borunda et al., 2022; Sezgın et al., 2020; Carey et al., 2006). In the Engineering domain, we implement an automatic fault detection algorithm for transmission lines (Reddy et al., 2016). (...) In the Math Benchmarking domain, we select a math question from GSM8K (Cobbe et al., 2021), a widely used benchmark for evaluating language models on grade school math problems.
Dataset Splits No The paper describes generating datasets for fine-tuning and evaluation scenarios (in-domain, generalization modes) by sampling '100 contexts per causal relationship' and generating '10 answers for each question per context'. It mentions 'held-out set of test samples' and distinct training and evaluation phases. However, it does not provide specific numerical splits (e.g., percentages or exact counts for train/validation/test from a fixed dataset) in the traditional sense, but rather describes how different *types* of causal relationships are used for training versus testing.
Hardware Specification No The paper does not provide specific details about the hardware used for running its experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper mentions using specific language models like 'Phi-3 mini' and 'Llama 3 8B' and refers to Hugging Face models. It does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed for replication.
Experiment Setup No The paper describes the fine-tuning methods (SFT, DPO, DPO+CCF) and data generation strategies, but it does not provide specific hyperparameter values such as learning rate, batch size, number of epochs, or optimizer settings used for the fine-tuning process.