reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning

Authors: Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, Yang Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that our framework consistently improves downstream performance. Code is available at https://github.com/ UCSC-REAL/Token Cleaning. [...] Comprehensive Experiments. We conduct extensive experiments across multiple tasks, demonstrating that our token cleaning pipeline consistently boosts performance over baselines and validates its practical merits.
Researcher Affiliation	Collaboration	1University of California, Santa Cruz 2Northeastern University 3Docta.ai 4Hong Kong University of Science and Technology (Guangzhou) 5Hong Kong Baptist University.
Pseudocode	Yes	Algorithm 1 Token Cleaning Pipeline
Open Source Code	Yes	Code is available at https://github.com/ UCSC-REAL/Token Cleaning.
Open Datasets	Yes	Data Pool We utilize a high-quality data pool with 50k sample size from five popular SFT datasets (300k in total): Flan v2 (Longpre et al., 2023), Open Assistant 1 (K opf et al., 2024), Stanford Alpaca (Taori et al., 2023), Dolly (Databricks, 2023), and Wizard LM (Xu et al., 2023).
Dataset Splits	Yes	For the self-evolving cleaning strategy, we heuristically divide the data pool into five equally sized subsets (10k samples). [...] Algorithm 1 Token Cleaning Pipeline: 2: Split dataset e D into a series of subset { e D0, , e DT }.
Hardware Specification	Yes	All experiments are conducted on eight NVIDIA L40S GPUs.
Software Dependencies	No	The paper mentions applying the LoRA technique and using the lm-eval-harness repository, but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries.
Experiment Setup	Yes	Following the experimental setup (Wang et al., 2023), we apply the Lo RA technique (Hu et al., 2022) with a rank-size of 64 and a scaling factor of 16. The overall batch size is 48, with the learning rate at 1e-4 as well as 1 training epoch. By default, the maximum input length is 2048.