Your Weak LLM is Secretly a Strong Teacher for Alignment

Authors: Leitian Tao, Yixuan Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical findings demonstrate that weak LLMs can provide feedback that rivals or even exceeds that of fully human-annotated data. Our study indicates a minimized impact of model size on feedback efficacy, shedding light on a scalable and sustainable alignment strategy. To deepen our understanding of alignment under weak LLM feedback, we conduct a series of qualitative and quantitative analyses, offering novel insights into the quality discrepancies between human feedback vs. weak LLM feedback.
Researcher Affiliation Academia Leitian Tao, Yixuan Li Department of Computer Sciences, University of Wisconsin-Madison EMAIL
Pseudocode No The paper describes the alignment process in Section 3.1 'ALIGNMENT VIA WEAK LLM FEEDBACK' using prose and mathematical equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is publicly available at https://github.com/deeplearning-wisc/weak_llm_teacher.
Open Datasets Yes To evaluate the performance, we use the Anthropic HH-RLHF (Helpful and Harmless) dataset (Bai et al., 2022a), which is the most commonly used dataset for alignment... We also evaluate on the Reddit TL;DR (TL;DR) summarization dataset from Stiennon et al. (2020), which consists of a Reddit post and several short summaries, judged for quality and informativeness by human evaluators.
Dataset Splits Yes The dataset consists of 112,000 training samples and 12,500 test samples and is publicly available... We preprocess the dataset by filtering out samples with token lengths greater than 512, which yields 100,000 training samples and 11,000 test samples. We split the training data into two disjoint sets. The first subset is used as labeled data Dl, and the remainder is used as the unlabeled data Du (by disregarding the preference labels)... For the Reddit TL;DR dataset... 92,000 training samples and 8,000 test samples. We divide the training data into two disjoint sets of 46,000 samples each... We vary the ratio between the labeled dataset size and the full dataset size: 1/16, 1/8, 1/4, 1/2, with the remaining data serving as the unlabeled subset.
Hardware Specification Yes Our experiments are conducted on servers equipped with NVIDIA A100 GPUs, with 80 GB of VRAM.
Software Dependencies Yes The operating system used is Ubuntu 22.04.2 LTS, supported by NVIDIA CUDA Toolkit version 12.1 and cuDNN version 8.9. All experimental implementations are carried out in Python version 3.11.4, utilizing the PyTorch framework version 1.12.1.
Experiment Setup Yes For a comprehensive description of the hyper-parameters employed in our experiments, please refer to Appendix A... Based on TRL 2, we deploy the training of both teacher and student models with the same hyperparameters as shown in Table 7 and Table 8... For the evaluation of the HH-RLHF, we leverage the state-of-the-art gold reward model Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback... To evaluate the response of the model, we adopted the temperature as 0.7 and the max tokens as 256.