Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning
Authors: Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, Yang Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our framework consistently improves downstream performance. Code is available at https://github.com/ UCSC-REAL/Token Cleaning. [...] Comprehensive Experiments. We conduct extensive experiments across multiple tasks, demonstrating that our token cleaning pipeline consistently boosts performance over baselines and validates its practical merits. |
| Researcher Affiliation | Collaboration | 1University of California, Santa Cruz 2Northeastern University 3Docta.ai 4Hong Kong University of Science and Technology (Guangzhou) 5Hong Kong Baptist University. |
| Pseudocode | Yes | Algorithm 1 Token Cleaning Pipeline |
| Open Source Code | Yes | Code is available at https://github.com/ UCSC-REAL/Token Cleaning. |
| Open Datasets | Yes | Data Pool We utilize a high-quality data pool with 50k sample size from five popular SFT datasets (300k in total): Flan v2 (Longpre et al., 2023), Open Assistant 1 (K opf et al., 2024), Stanford Alpaca (Taori et al., 2023), Dolly (Databricks, 2023), and Wizard LM (Xu et al., 2023). |
| Dataset Splits | Yes | For the self-evolving cleaning strategy, we heuristically divide the data pool into five equally sized subsets (10k samples). [...] Algorithm 1 Token Cleaning Pipeline: 2: Split dataset e D into a series of subset { e D0, , e DT }. |
| Hardware Specification | Yes | All experiments are conducted on eight NVIDIA L40S GPUs. |
| Software Dependencies | No | The paper mentions applying the LoRA technique and using the lm-eval-harness repository, but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | Following the experimental setup (Wang et al., 2023), we apply the Lo RA technique (Hu et al., 2022) with a rank-size of 64 and a scaling factor of 16. The overall batch size is 48, with the learning rate at 1e-4 as well as 1 training epoch. By default, the maximum input length is 2048. |