reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Complementarity: Toward Better Metrics and Optimizing Data Efficiency in LLMs

Authors: Roy Siegelmann

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first establish a strong correlation between Complementarity and domain-specific task performance. Most interestingly, we demonstrate that the Complementarity taken over a training validation set provides a better predictor of generalization to future test sets than directly measuring performance on a test validation set. With this, we introduce an algorithm that carefully selects the data to fine-tune upon, leading to a high-performing fine-tuned generalist model while using only a fraction of the data, and without requiring data from the test domain. [...] To study the predictive capability of Complementarity as a metric, we look for a correlation between Complementarity and performance across four domains: Coding, Mathematics, Medicine, and Physics. [...] Table 1 contains the results of this experiment. [...] Table 3 summarizes the results for p = 4, k = 2, and N = 10, using the same four domains as above.
Researcher Affiliation	Academia	Roy N. Siegelmann EMAIL Department of Applied Mathematics and Statistics Johns Hopkins University, Baltimore, MD 21218
Pseudocode	Yes	Our proof-of-concept algorithm is as follows: Given a base model M and p domains, split off training data R1, ..., Rp and validation sets V1, ...Vp. 1. Create N splits R1, ..., RN of combined R1, ..., Rp dataset. Fine-tune to receive M is. 2. Select the k M is with highest averaged Complementarity across all V1, ..., Vp. 3. Combine the associated split datasets and fine-tune M, receiving M avg as the choice model.
Open Source Code	No	The paper does not provide a specific link to source code, nor does it explicitly state that the code for their methodology is open-source or available in supplementary materials. It mentions using existing open-source models (Mistral, LLaMa) but not releasing their own implementation.
Open Datasets	Yes	The evaluation coding dataset selected was Mostly Basic Python Problems (MBPP)... The evaluation mathematics dataset selected was Grade School Math 8k (GSM8K)... The medicine and physics datasets chosen were from Measuring Massive Multitask Language Understanding (MMLU)... For more details about the datasets used for training and evaluation are in Appendix Tables 4, 5 respectively. (Jiang et al., 2023) (AI@Meta, 2024)
Dataset Splits	Yes	Given a base model M and p domains, split off training data R1, ..., Rp and validation sets V1, ...Vp. 1. Create N splits R1, ..., RN of combined R1, ..., Rp dataset. [...] Table 3 summarizes the results for p = 4, k = 2, and N = 10, using the same four domains as above. [...] The first section consists of the baseline model and the model fine-tuned naively on all the data, i.e. 10,000 entries from each of the four domains. [...] four different models fine-tuned on the best two 1/10ths of the dataset
Hardware Specification	No	The paper mentions that Mistral 7B Instruct v0.2 is "highly efficient under quantization, so it can be fine-tuned in a realistic time-scale" but does not specify the actual hardware (e.g., GPU models, CPU, memory) used for running the experiments or fine-tuning.
Software Dependencies	No	The paper mentions using Python for the MBPP dataset and RegEx for text scraping, but it does not specify any particular software libraries, frameworks, or their version numbers that were used for their implementation or experiments.
Experiment Setup	Yes	As such, we settled on Mistral 7B Instruct v0.2 (Jiang et al., 2023) as our main model, and LLa Ma 8B Instruct (AI@Meta, 2024) for verification of our results. We created four fine-tuned versions of the base model, each on one of the four different training datasets, and evaluate the performance of each model on all four domain tasks. [...] For this, we create six fine-tuned versions of the base model, each trained on an even split from a pair of domains. [...] Table 3 summarizes the results for p = 4, k = 2, and N = 10, using the same four domains as above.