A Survey on Data Selection for LLM Instruction Tuning
Authors: Bolin Zhang, Jiahao Wang, Qianlong Du, Jiajun Zhang, Zhiying Tu, Dianhui Chu
JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper presents a comprehensive survey on data selection for LLM instruction tuning. ... Section 4 presents the evaluation methods and shows the results of different instruction selection methods. ... To measure the effectiveness of different instruction data selection methods, several evaluation metrics are proposed and can be divided into three categories: winning rate, inner comparison, and external comparison. In this section, we first introduce each distinct evaluation and then compare different data selection methods across multiple benchmarks under these evaluations to analyze which methods achieve better performance. ... Table 2. Performances of different selection methods on wining rate. ... Table 3. Inner comparisons of LLM tuned on subsets with itself tuned on full sets. ... Table 4. External comparisons of LLM tuned on subsets with the other LLM. |
| Researcher Affiliation | Academia | BOLIN ZHANG , Harbin Institute of Technology, China JIAHAO WANG , Institute of Automation, Chinese Academy of Sciences, China QIANLONG DU, Institute of Automation, Chinese Academy of Sciences, China JIAJUN ZHANG , Institute of Automation, Chinese Academy of Sciences, China ZHIYING TU, Harbin Institute of Technology (Weihai), China DIANHUI CHU, Harbin Institute of Technology (Weihai), China |
| Pseudocode | No | The paper describes methods such as INSTRUCTMINING, IFD, and ALPAGASUS in prose with mathematical formulas, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | To facilities the community, we maintain a paper list1, collect commonly instruction sets for data selection. 1https://github.com/Bolin97/awesome-instruction-selector (This link refers to a paper list, not source code for the survey's methodology itself.) The paper does not provide concrete access to source code for its own methodology. |
| Open Datasets | Yes | Section 2 Instruction Datasets Various instruction tuning datasets (e.g. Self-Instruct and Alpaca), generated by LLMs, offer a wealth of samples without human labor... This section describes the scale and construction procedures of several commonly instruction tuning datasets: Self-instruct [27]... Alpaca [25]... Wizard LM [31]... LIMA [36]... Dolly V2 [6]... P3 [22]... |
| Dataset Splits | No | The paper analyzes results from other works that use subsets like 'Llama-7B(5%), Llama-7B(full) alpaca' or 'selfinstruct(2k)', indicating that a portion of the dataset was used. However, it does not provide specific details on how these splits were generated (e.g., random seed, methodology, or exact counts/percentages for all data) for reproduction by the reader within the scope of this survey paper's own methodology. The mentioned splits refer to the reported results of other instruction tuning methods, not the survey's own experimental setup. |
| Hardware Specification | No | The paper is a survey analyzing existing literature and does not report on new experimental results or provide details on the hardware used for any computational tasks specific to this survey's methodology. |
| Software Dependencies | No | The paper is a survey analyzing existing literature and does not report on new experimental results or provide details on specific software dependencies with version numbers required for its own methodology. |
| Experiment Setup | No | The paper is a survey analyzing existing literature and does not report on new experimental results or provide specific details on experimental setup, hyperparameters, or training configurations for its own methodology. |