Complementarity: Toward Better Metrics and Optimizing Data Efficiency in LLMs
Authors: Roy Siegelmann
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first establish a strong correlation between Complementarity and domain-specific task performance. Most interestingly, we demonstrate that the Complementarity taken over a training validation set provides a better predictor of generalization to future test sets than directly measuring performance on a test validation set. With this, we introduce an algorithm that carefully selects the data to fine-tune upon, leading to a high-performing fine-tuned generalist model while using only a fraction of the data, and without requiring data from the test domain. [...] To study the predictive capability of Complementarity as a metric, we look for a correlation between Complementarity and performance across four domains: Coding, Mathematics, Medicine, and Physics. [...] Table 1 contains the results of this experiment. [...] Table 3 summarizes the results for p = 4, k = 2, and N = 10, using the same four domains as above. |
| Researcher Affiliation | Academia | Roy N. Siegelmann EMAIL Department of Applied Mathematics and Statistics Johns Hopkins University, Baltimore, MD 21218 |
| Pseudocode | Yes | Our proof-of-concept algorithm is as follows: Given a base model M and p domains, split off training data R1, ..., Rp and validation sets V1, ...Vp. 1. Create N splits R1, ..., RN of combined R1, ..., Rp dataset. Fine-tune to receive M is. 2. Select the k M is with highest averaged Complementarity across all V1, ..., Vp. 3. Combine the associated split datasets and fine-tune M, receiving M avg as the choice model. |
| Open Source Code | No | The paper does not provide a specific link to source code, nor does it explicitly state that the code for their methodology is open-source or available in supplementary materials. It mentions using existing open-source models (Mistral, LLaMa) but not releasing their own implementation. |
| Open Datasets | Yes | The evaluation coding dataset selected was Mostly Basic Python Problems (MBPP)... The evaluation mathematics dataset selected was Grade School Math 8k (GSM8K)... The medicine and physics datasets chosen were from Measuring Massive Multitask Language Understanding (MMLU)... For more details about the datasets used for training and evaluation are in Appendix Tables 4, 5 respectively. (Jiang et al., 2023) (AI@Meta, 2024) |
| Dataset Splits | Yes | Given a base model M and p domains, split off training data R1, ..., Rp and validation sets V1, ...Vp. 1. Create N splits R1, ..., RN of combined R1, ..., Rp dataset. [...] Table 3 summarizes the results for p = 4, k = 2, and N = 10, using the same four domains as above. [...] The first section consists of the baseline model and the model fine-tuned naively on all the data, i.e. 10,000 entries from each of the four domains. [...] four different models fine-tuned on the best two 1/10ths of the dataset |
| Hardware Specification | No | The paper mentions that Mistral 7B Instruct v0.2 is "highly efficient under quantization, so it can be fine-tuned in a realistic time-scale" but does not specify the actual hardware (e.g., GPU models, CPU, memory) used for running the experiments or fine-tuning. |
| Software Dependencies | No | The paper mentions using Python for the MBPP dataset and RegEx for text scraping, but it does not specify any particular software libraries, frameworks, or their version numbers that were used for their implementation or experiments. |
| Experiment Setup | Yes | As such, we settled on Mistral 7B Instruct v0.2 (Jiang et al., 2023) as our main model, and LLa Ma 8B Instruct (AI@Meta, 2024) for verification of our results. We created four fine-tuned versions of the base model, each on one of the four different training datasets, and evaluate the performance of each model on all four domain tasks. [...] For this, we create six fine-tuned versions of the base model, each trained on an even split from a pair of domains. [...] Table 3 summarizes the results for p = 4, k = 2, and N = 10, using the same four domains as above. |