reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory

Authors: Aymane El Firdoussi, Mohamed El Amine Seddik, Soufiane Hayou, Reda Alami, Ahmed Alzubaidi, Hakim Hacid

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experimental setup validates our theoretical results. In this section, we present our experiments conducted on different real-world tasks and datasets in order to illustrate our theoretical findings presented in the previous section.
Researcher Affiliation	Collaboration	1Technology Innovation Institute, Abu Dhabi, UAE 2Simons Institute, Berkeley, USA Equal contribution EMAIL
Pseudocode	No	The paper describes methods and mathematical derivations (e.g., Theorem 4.2, Corollary 4.3) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block. The appendices contain useful lemmas and random matrix analysis derivations, not pseudocode for the proposed approach.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. It mentions using various third-party LLMs (e.g., Falcon 2-11B Instruct, Alpaca3-70B, Llama-Guard-3-8B) but not its own implementation code.
Open Datasets	Yes	We use the Amazon Reviews datasets (Blitzer et al. (2007)). We also conducted experiments on the MNIST (Le Cun & Cortes (2010)) dataset. We finetune the Falcon 2-11B Instruct model (Malartic et al., 2024) on n = 5000 human data from Anthropic s HH-RLHF dataset, which correspond to real data, while synthetic data are extracted from the PKU safe RLHF dataset. For the evaluation, we use the ALERT dataset (Tedeschi et al. (2024)).
Dataset Splits	Yes	We finetune the Falcon 2-11B Instruct model (Malartic et al., 2024) on n = 5000 human data from Anthropic s HH-RLHF dataset... We increase the amount of synthetic data by injecting gradually five batches of 7000 samples per batch... The number of real data sample used is n = 800... training on a mix of real (n = 500)... The test accuracy is computed over the testing dataset extracted from Ji et al. (2024), with 2.8k Q&A samples.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, or cloud computing specifications. It only describes the software setup and hyperparameters in the appendices.
Software Dependencies	No	The paper mentions using 'the standard scaler from sklearn (Pedregosa et al., 2011)' and discusses various LLMs by name (e.g., Falcon 2-11B Instruct, Llama-Guard-3-8B). Appendices E and F list 'LoRA Arguments', 'Trainer Arguments', 'optim paged adamw 32bit', 'lr scheduler type cosine' and 'use flash attention 2 true' with hyperparameters, but do not provide specific version numbers for software libraries like scikit-learn, PyTorch, or CUDA.
Experiment Setup	Yes	Table 1: Implementation Details for the safety LLM alignment with IPO, including LoRA Arguments (lora r 128, lora alpha 128, lora dropout 0.05), Trainer Arguments (bf16 true, beta 0.01, eval steps 100, gradient accumulation steps 4, learning rate 5.0e-6, log level info, logging steps 10, lr scheduler type cosine, max length 1024, max prompt length 512, num train epochs 1, optim paged adamw 32bit, per device train batch size 4, per device eval batch size 8, seed 42, warmup ratio 0.1, Label smoothing 0.001). Similar detailed tables (Table 2 and Table 3) are provided for Fine-tuning Llama3.1-8B-Instruct and Gemma-2-2B-it.