reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Survey on Data Selection for LLM Instruction Tuning

Authors: Bolin Zhang, Jiahao Wang, Qianlong Du, Jiajun Zhang, Zhiying Tu, Dianhui Chu

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper presents a comprehensive survey on data selection for LLM instruction tuning. ... Section 4 presents the evaluation methods and shows the results of different instruction selection methods. ... To measure the effectiveness of different instruction data selection methods, several evaluation metrics are proposed and can be divided into three categories: winning rate, inner comparison, and external comparison. In this section, we first introduce each distinct evaluation and then compare different data selection methods across multiple benchmarks under these evaluations to analyze which methods achieve better performance. ... Table 2. Performances of different selection methods on wining rate. ... Table 3. Inner comparisons of LLM tuned on subsets with itself tuned on full sets. ... Table 4. External comparisons of LLM tuned on subsets with the other LLM.
Researcher Affiliation	Academia	BOLIN ZHANG , Harbin Institute of Technology, China JIAHAO WANG , Institute of Automation, Chinese Academy of Sciences, China QIANLONG DU, Institute of Automation, Chinese Academy of Sciences, China JIAJUN ZHANG , Institute of Automation, Chinese Academy of Sciences, China ZHIYING TU, Harbin Institute of Technology (Weihai), China DIANHUI CHU, Harbin Institute of Technology (Weihai), China
Pseudocode	No	The paper describes methods such as INSTRUCTMINING, IFD, and ALPAGASUS in prose with mathematical formulas, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	To facilities the community, we maintain a paper list1, collect commonly instruction sets for data selection. 1https://github.com/Bolin97/awesome-instruction-selector (This link refers to a paper list, not source code for the survey's methodology itself.) The paper does not provide concrete access to source code for its own methodology.
Open Datasets	Yes	Section 2 Instruction Datasets Various instruction tuning datasets (e.g. Self-Instruct and Alpaca), generated by LLMs, offer a wealth of samples without human labor... This section describes the scale and construction procedures of several commonly instruction tuning datasets: Self-instruct [27]... Alpaca [25]... Wizard LM [31]... LIMA [36]... Dolly V2 [6]... P3 [22]...
Dataset Splits	No	The paper analyzes results from other works that use subsets like 'Llama-7B(5%), Llama-7B(full) alpaca' or 'selfinstruct(2k)', indicating that a portion of the dataset was used. However, it does not provide specific details on how these splits were generated (e.g., random seed, methodology, or exact counts/percentages for all data) for reproduction by the reader within the scope of this survey paper's own methodology. The mentioned splits refer to the reported results of other instruction tuning methods, not the survey's own experimental setup.
Hardware Specification	No	The paper is a survey analyzing existing literature and does not report on new experimental results or provide details on the hardware used for any computational tasks specific to this survey's methodology.
Software Dependencies	No	The paper is a survey analyzing existing literature and does not report on new experimental results or provide details on specific software dependencies with version numbers required for its own methodology.
Experiment Setup	No	The paper is a survey analyzing existing literature and does not report on new experimental results or provide specific details on experimental setup, hyperparameters, or training configurations for its own methodology.