reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models

Authors: Jiachun Li, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present MIRAGE, a synthetic dataset that addresses the limitations of previous work, specifically the lack of comprehensive evaluation and flexible test data. In it, we evaluate LLMs capabilities in both the inductive and deductive stages, allowing for flexible variation in input distribution, task scenario, and task difficulty to analyze the factors influencing LLMs inductive reasoning. Based on these multi-faceted evaluations, we demonstrate that the LLM is a poor rule-based reasoner.
Researcher Affiliation	Academia	Jiachun Li1,2, Pengfei Cao1,2, Zhuoran Jin1,2, Yubo Chen1,2, , Kang Liu1,2, Jun Zhao1,2 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences EMAIL
Pseudocode	No	The paper describes methods and processes like rule generation and fact generation in prose and mathematical equations (e.g., Equation 2), but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps in a code-like format.
Open Source Code	Yes	Our code is available at: https://github.com/Bug Makerzzz/mirage.
Open Datasets	Yes	In this paper, we present MIRAGE (Meta Inductive Re Asonin G Evaluation), a dataset designed to address the two aforementioned limitations. ... Using the automatically synthesized rules, we can generate facts arbitrarily through instantiation, ensuring the flexibility and scalability of the test data. ... Our code is available at: https://github.com/Bug Makerzzz/mirage.
Dataset Splits	No	The paper mentions generating questions for each test and sampling specific numbers of questions (e.g., "We sample 500 questions for each test.", "We randomly choose 100 pieces of test data from the dataset"), and in fine-tuning, training on "8,000 samples". However, it does not explicitly provide percentages or absolute counts for traditional training, validation, and test splits of a fixed dataset, nor does it refer to standard, predefined splits for its custom-generated dataset.
Hardware Specification	Yes	All experiments are conducted on 4 NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies	No	The paper lists specific models used (e.g., "Llama-2-13b-chat-hf, Meta-Llama-3-8B-Instruct, gpt-4-0613, gpt-4o-2024-05-13 and claude-3-5-sonnet-20240620"), but does not specify ancillary software dependencies like Python, PyTorch, or CUDA versions.
Experiment Setup	Yes	For the first three models, given their strong instruction-following capabilities, we provide only the instruction and allow them to answer the questions in a zero-shot setting. For the latter two models, to improve the format accuracy of the response, we additionally provide five examples before they answer the questions. Unless otherwise specified, we continue to use this setup to prompt the model in the subsequent experiments. For the dataset setting, we fix the size N at 5 and measure performance across four scenarios when the dimension D = 3, 5, 8. We sample 500 questions for each test. ... For the training parameters, we set the learning rate to 0.0001, the batch size to 1, and the number of epochs to 10. Additionally, Lo RA is employed to train people on different types of tasks.