reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automatically Generating Numerous Context-Driven SFT Data for LLMs Across Diverse Granularity

Authors: Shanghaoran Quan

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted incorporating both automatic and human evaluations, encompassing four widelyused benchmarks and a test scenario in English and Chinese. The results highlight the significant advantages of AUGCON in producing high diversity, quality, and fidelity SFT data against several state-of-the-art methods.
Researcher Affiliation	Academia	Shanghaoran Quan School of Computer Science and Engineering, Beihang University Beijing, Haidian, 100191, China EMAIL
Pseudocode	Yes	Algorithm 1: Context Split Tree
Open Source Code	Yes	Code https://github.com/quanshr/AugCon
Open Datasets	Yes	Additionally, automatic evaluations conducted on four popularly used English benchmarks with relevant metrics further highlight the significant advantages our method holds in capturing contextual knowledge when compared to other state-of-the-art context-driven SFT data generation approaches. Specifically, the contributions of our work lie on: We propose AUGCON, which can automatically generate multi-granularity context-driven SFT data from the corpus for LLMs at scale with high diversity, quality, and fidelity, providing the solution to a realistic industrial and academic problem worth studying. Our ideas of deriving queries via CST, training the scorer using contrastive learning to collaborate with the generation process to refine data, and synergistic integrating self-alignment and self-improving to obtain high-fidelity responses, are very novel and may inspire further works. Extensive experiments incorporating both automatic and human evaluations, encompassing four widely-used benchmarks and a test scenario in English and Chinese compared with other state-of-the-art methods demonstrate the effectiveness and advantages of AUGCON. To boost the academy and for others to generate highdiversity SFT data on their own corpus without effort, we open-source all of our code, dataset, and fine-tuned model at: https://github.com/quanshr/AugCon.
Dataset Splits	No	The paper mentions testing on test QA pairs from benchmarks and constructing a Daily M test set, but it does not provide specific percentages or counts for training/validation/test splits of their own generated SFT data, or for the Daily M corpus itself for training.
Hardware Specification	No	The paper mentions using specific LLMs like Llama3-70B-Instruct and Qwen1.5-32B-Chat as base models for fine-tuning, but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or train these models.
Software Dependencies	No	The paper mentions using Llama3-70B-Instruct and Qwen1.5-32B-Chat as base models, but does not specify any programming languages, libraries, or other software dependencies with version numbers.
Experiment Setup	Yes	The minimum length threshold λ and the initial context length l are like the lower bound and upper bound to control the granularity distribution of generated questions. Then, we start with an empty set and add one training query each time, only if the current query has a ROUGE-L precision score of less than 0.7 compared to any previously added queries. Note that we do not generate all corresponding negative examples for positive data for training scorer, but rather randomly select a very small number of samples (e.g. only 500 pairs for each negative types in our implementation) to form the training set DSc. For methods such as Adapt LLM, ETRC, Context Instruct, and our AUGCON which generate query-response pairs based on context, we adhere to a standard where every 35 Chinese characters derive one query-response pair to ensure a fair comparison. We limit the number of generated entries to the same in the comparison because we find that all methods spend much more time on final fine-tuning process compared to the previous generation.