RLTHF: Targeted Human Feedback for LLM Alignment

Authors: Yifei Xu, Tusher Chakraborty, Emre Kiciman, Bibek Aryal, Srinagesh Sharma, Songwu Lu, Ranveer Chandra

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF s curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF.
Researcher Affiliation Collaboration 1Microsoft 2University of California, Los Angeles. Correspondence to: Yifei Xu <EMAIL>, Tusher Chakraborty <EMAIL>.
Pseudocode Yes Find the corresponding pseudocode in Appendix B.
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology or a direct link to a code repository. It mentions 'Alpaca Eval' with a GitHub link, but this is a third-party tool used, not the authors' own implementation code.
Open Datasets Yes HH-RLHF: We use Anthropic s helpful and harmless human preference dataset (Bai et al., 2022a), which includes 161K training samples. TL;DR: We use the Reddit TL;DR summarization dataset (V olske et al., 2017) filtered by Open AI along with their human preference dataset (Stiennon et al., 2020), which includes 93K training samples.
Dataset Splits Yes Sharding: RLTHF is run on a randomly down-sampled 1/4 shard of the full dataset. In each iteration, human annotation is applied to 4% of the given shard. We use an unseen test set of 4K samples for both HH-RLHF and TL;DR.
Hardware Specification Yes All training is done on a node of 8 A100 NVIDIA GPUs with Deep Speed.
Software Dependencies No The paper mentions 'Deep Speed' as a framework, and specific LLMs like 'Qwen2.5-3B' and 'Llama-3.1-8B-Instruct' as models, but does not provide specific version numbers for any key software libraries, frameworks, or programming languages used for implementation.
Experiment Setup Yes SFT: We perform full-parameter fine-tuning on Qwen2.5-3B base model. We use learning rate 2e 5, warm up ratio 0.2, and batch size of 32 for training 4 epochs. Reward Modeling: We train our reward model with Llama-3.1-8B-Instruct. This was a Lo RA fine-tuning. We use learning rate 1e 4, warm up ratio 0.1, Lo RA rank 32, Lo RA alpha 64, and batch size of 128 for training 2 epochs. DPO: We perform DPO on the SFT model with data sanitized by RLTHF. We use learning rate 1e 6, warm up ratio 0.1, beta 0.1 and 0.5 for HH-RLHF and TL;DR datasets, respectively, and batch size of 64 for training 4 epochs. Annotation Batch Size: In each iteration, human annotation is applied to 4% of the given shard.