Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning

Authors: Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, Yang Liu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our framework consistently improves downstream performance. Code is available at https://github.com/ UCSC-REAL/Token Cleaning. [...] Comprehensive Experiments. We conduct extensive experiments across multiple tasks, demonstrating that our token cleaning pipeline consistently boosts performance over baselines and validates its practical merits.
Researcher Affiliation Collaboration 1University of California, Santa Cruz 2Northeastern University 3Docta.ai 4Hong Kong University of Science and Technology (Guangzhou) 5Hong Kong Baptist University.
Pseudocode Yes Algorithm 1 Token Cleaning Pipeline
Open Source Code Yes Code is available at https://github.com/ UCSC-REAL/Token Cleaning.
Open Datasets Yes Data Pool We utilize a high-quality data pool with 50k sample size from five popular SFT datasets (300k in total): Flan v2 (Longpre et al., 2023), Open Assistant 1 (K opf et al., 2024), Stanford Alpaca (Taori et al., 2023), Dolly (Databricks, 2023), and Wizard LM (Xu et al., 2023).
Dataset Splits Yes For the self-evolving cleaning strategy, we heuristically divide the data pool into five equally sized subsets (10k samples). [...] Algorithm 1 Token Cleaning Pipeline: 2: Split dataset e D into a series of subset { e D0, , e DT }.
Hardware Specification Yes All experiments are conducted on eight NVIDIA L40S GPUs.
Software Dependencies No The paper mentions applying the LoRA technique and using the lm-eval-harness repository, but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries.
Experiment Setup Yes Following the experimental setup (Wang et al., 2023), we apply the Lo RA technique (Hu et al., 2022) with a rank-size of 64 and a scaling factor of 16. The overall batch size is 48, with the learning rate at 1e-4 as well as 1 training epoch. By default, the maximum input length is 2048.