Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory
Authors: Aymane El Firdoussi, Mohamed El Amine Seddik, Soufiane Hayou, Reda Alami, Ahmed Alzubaidi, Hakim Hacid
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experimental setup validates our theoretical results. In this section, we present our experiments conducted on different real-world tasks and datasets in order to illustrate our theoretical findings presented in the previous section. |
| Researcher Affiliation | Collaboration | 1Technology Innovation Institute, Abu Dhabi, UAE 2Simons Institute, Berkeley, USA Equal contribution EMAIL |
| Pseudocode | No | The paper describes methods and mathematical derivations (e.g., Theorem 4.2, Corollary 4.3) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block. The appendices contain useful lemmas and random matrix analysis derivations, not pseudocode for the proposed approach. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. It mentions using various third-party LLMs (e.g., Falcon 2-11B Instruct, Alpaca3-70B, Llama-Guard-3-8B) but not its own implementation code. |
| Open Datasets | Yes | We use the Amazon Reviews datasets (Blitzer et al. (2007)). We also conducted experiments on the MNIST (Le Cun & Cortes (2010)) dataset. We finetune the Falcon 2-11B Instruct model (Malartic et al., 2024) on n = 5000 human data from Anthropic s HH-RLHF dataset, which correspond to real data, while synthetic data are extracted from the PKU safe RLHF dataset. For the evaluation, we use the ALERT dataset (Tedeschi et al. (2024)). |
| Dataset Splits | Yes | We finetune the Falcon 2-11B Instruct model (Malartic et al., 2024) on n = 5000 human data from Anthropic s HH-RLHF dataset... We increase the amount of synthetic data by injecting gradually five batches of 7000 samples per batch... The number of real data sample used is n = 800... training on a mix of real (n = 500)... The test accuracy is computed over the testing dataset extracted from Ji et al. (2024), with 2.8k Q&A samples. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, or cloud computing specifications. It only describes the software setup and hyperparameters in the appendices. |
| Software Dependencies | No | The paper mentions using 'the standard scaler from sklearn (Pedregosa et al., 2011)' and discusses various LLMs by name (e.g., Falcon 2-11B Instruct, Llama-Guard-3-8B). Appendices E and F list 'LoRA Arguments', 'Trainer Arguments', 'optim paged adamw 32bit', 'lr scheduler type cosine' and 'use flash attention 2 true' with hyperparameters, but do not provide specific version numbers for software libraries like scikit-learn, PyTorch, or CUDA. |
| Experiment Setup | Yes | Table 1: Implementation Details for the safety LLM alignment with IPO, including LoRA Arguments (lora r 128, lora alpha 128, lora dropout 0.05), Trainer Arguments (bf16 true, beta 0.01, eval steps 100, gradient accumulation steps 4, learning rate 5.0e-6, log level info, logging steps 10, lr scheduler type cosine, max length 1024, max prompt length 512, num train epochs 1, optim paged adamw 32bit, per device train batch size 4, per device eval batch size 8, seed 42, warmup ratio 0.1, Label smoothing 0.001). Similar detailed tables (Table 2 and Table 3) are provided for Fine-tuning Llama3.1-8B-Instruct and Gemma-2-2B-it. |