MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models

Authors: Jiachun Li, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present MIRAGE, a synthetic dataset that addresses the limitations of previous work, specifically the lack of comprehensive evaluation and flexible test data. In it, we evaluate LLMs capabilities in both the inductive and deductive stages, allowing for flexible variation in input distribution, task scenario, and task difficulty to analyze the factors influencing LLMs inductive reasoning. Based on these multi-faceted evaluations, we demonstrate that the LLM is a poor rule-based reasoner.
Researcher Affiliation Academia Jiachun Li1,2, Pengfei Cao1,2, Zhuoran Jin1,2, Yubo Chen1,2, , Kang Liu1,2, Jun Zhao1,2 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences EMAIL
Pseudocode No The paper describes methods and processes like rule generation and fact generation in prose and mathematical equations (e.g., Equation 2), but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps in a code-like format.
Open Source Code Yes Our code is available at: https://github.com/Bug Makerzzz/mirage.
Open Datasets Yes In this paper, we present MIRAGE (Meta Inductive Re Asonin G Evaluation), a dataset designed to address the two aforementioned limitations. ... Using the automatically synthesized rules, we can generate facts arbitrarily through instantiation, ensuring the flexibility and scalability of the test data. ... Our code is available at: https://github.com/Bug Makerzzz/mirage.
Dataset Splits No The paper mentions generating questions for each test and sampling specific numbers of questions (e.g., "We sample 500 questions for each test.", "We randomly choose 100 pieces of test data from the dataset"), and in fine-tuning, training on "8,000 samples". However, it does not explicitly provide percentages or absolute counts for traditional training, validation, and test splits of a fixed dataset, nor does it refer to standard, predefined splits for its custom-generated dataset.
Hardware Specification Yes All experiments are conducted on 4 NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies No The paper lists specific models used (e.g., "Llama-2-13b-chat-hf, Meta-Llama-3-8B-Instruct, gpt-4-0613, gpt-4o-2024-05-13 and claude-3-5-sonnet-20240620"), but does not specify ancillary software dependencies like Python, PyTorch, or CUDA versions.
Experiment Setup Yes For the first three models, given their strong instruction-following capabilities, we provide only the instruction and allow them to answer the questions in a zero-shot setting. For the latter two models, to improve the format accuracy of the response, we additionally provide five examples before they answer the questions. Unless otherwise specified, we continue to use this setup to prompt the model in the subsequent experiments. For the dataset setting, we fix the size N at 5 and measure performance across four scenarios when the dimension D = 3, 5, 8. We sample 500 questions for each test. ... For the training parameters, we set the learning rate to 0.0001, the batch size to 1, and the number of epochs to 10. Additionally, Lo RA is employed to train people on different types of tasks.