reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Authors: Vighnesh Subramaniam, Yilun Du, Joshua B Tenenbaum, Antonio Torralba, Shuang Li, Igor Mordatch

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We quantitatively illustrate the efficacy of the approach across a wide suite of reasoning tasks. Project website at https://llm-multiagent-ft.github.io. 3 EXPERIMENTS. 3.3 QUANTITATIVE RESULTS. We compare baselines and our method... in Table 1. The accuracy and standard error for each dataset are reported.
Researcher Affiliation	Academia	Vighnesh Subramaniam MIT CSAIL EMAIL Yilun Du Harvard University EMAIL Joshua B. Tenenbaum MIT CSAIL, BCS, CBMM EMAIL Antonio Torralba MIT CSAIL EMAIL Shuang Li Stanford University EMAIL Igor Mordatch UC Berkeley EMAIL
Pseudocode	Yes	The algorithm for the proposed approach of L iterations of finetuning is detailed in Algorithm 1. The steps for collecting data for finetuning the generation models are marked in red, and the finetuning of critic models is shown in blue. ... We provide pseudocode in Algorithm 2.
Open Source Code	No	Project website at https://llm-multiagent-ft.github.io. The paper mentions a project website, which typically offers an overview, but does not explicitly state that the source code for the described methodology is hosted there or provide a direct link to a code repository.
Open Datasets	Yes	Grade School Math (GSM). (Cobbe et al., 2021) consists of math word problems... MATH. Hendrycks et al. (2021) consists of competition-level math problems... MMLU Evaluation We introduce an additional evaluation with the MMLU benchmark, finetuning on 500 MMLU examples and testing on 500 different MMLU examples.
Dataset Splits	Yes	For each dataset, we randomly select 500 examples for finetuning the language model. Additionally, we select 500 held-out problems for evaluation. All results are reported over 500 fixed evaluation problems, expect GSM results for GPT-3.5 which are reported over 1000 fixed evaluation problems (to construct nonoverlapping confidence bars).
Hardware Specification	Yes	For all open-source models, we perform finetuning using a total of eight 40GB A100 GPUs and four 80GB H100 GPUs. The evaluation of individual inference times for multi-agent finetuning with open-source models took approximately 30 to 36 hours. Phi-3 ... on two 40GB A100 GPUs or one 80GB H100 GPU... Mistral ... on four 40GB A100 GPUs or two 80GB H100 GPUs... LLa MA-3 ... on three 80GB H100 GPUs
Software Dependencies	No	No, the paper mentions the base language models used (Phi-3 4B, Mistral 7B, LLa MA-3 8B, GPT-3.5) but does not specify versions for ancillary software dependencies like programming languages, libraries, or frameworks (e.g., PyTorch, TensorFlow, CUDA, Python versions) used for their implementation.
Experiment Setup	Yes	Phi-3 ... run a total of two epochs of finetuning for generation agents and one epoch of finetuning for critic agents. We use a batch size of 1 and a learning rate of 5e-6 for generation agents and 5e-7 for critic agents. When applying multiple iterations of finetuning, we use a learning rate of 5e-7, and a weight decay of 1e-3 across both generation and critic agents.