Weak to Strong Generalization for Large Language Models with Multi-capabilities
Authors: Yucheng Zhou, Jianbing Shen, Yu Cheng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we conduct extensive experiments to investigate weak to strong generalization for LLMs with multi-capabilities. The experiments reveal that different capabilities tend to remain relatively independent in this generalization, and the effectiveness of weak supervision is significantly impacted by the quality and diversity of the weak datasets. Moreover, the selfbootstrapping of the strong model leads to performance degradation due to its overconfidence and the limited diversity of its generated dataset. To address these issues, we proposed a novel training framework using reward models to select valuable data, thereby providing weak supervision for strong model training. In addition, we propose a two-stage training method on both weak and selected datasets to train the strong model. Experimental results demonstrate our method significantly improves the weak to strong generalization with multi-capabilities. |
| Researcher Affiliation | Academia | Yucheng Zhou1, Jianbing Shen1 , Yu Cheng2 1SKL-IOTSC, CIS, University of Macau, 2The Chinese University of Hong Kong EMAIL EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods using equations and prose in sections like 'BACKGROUND AND NOTATION' and 'MULTI-CAPABILITIES WEAK TO STRONG GENERALIZATION WITH REWARD MODELS', but it does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about the release of source code for the described methodology, nor does it include links to a code repository. |
| Open Datasets | Yes | To explore the models performance in weak to strong generalization settings, we employ eight datasets that span various skills, such as GSM8K (Cobbe et al., 2021) for mathematical abilities, MC-TACO (Zhou et al., 2019) for temporal reasoning, SCAN (Lake & Baroni, 2018) for planning ability, CREAK (Onoe et al., 2021) for fact-checking and commonsense reasoning ability, ECQA (Aggarwal et al., 2021) for explainable commonsense reasoning, e-SNLI (Camburu et al., 2018) for logical reasoning ability, Open Book QA (Mihaylov et al., 2018) for fact reasoning, and Sci Q (Welbl et al., 2017) for science-related abilities. |
| Dataset Splits | Yes | Following Burns et al. (2023), each capability s dataset is split into a labeled training set, an unlabeled training set, and a test set to simulate the weak to strong generalization scenario. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA A100 80G GPUs. |
| Software Dependencies | No | The paper mentions the use of Qwen-1.5 models and the Adam optimizer (Kingma & Ba, 2015), but does not provide specific version numbers for any software libraries, frameworks, or programming languages used for implementation. |
| Experiment Setup | Yes | To ensure a fair comparison, we followed the experimental setup from Burns et al. (2023), conducting all experiments with 2 epochs and a batch size of 40. The optimizer used was Adam (Kingma & Ba, 2015) with a learning rate of 1e-5. Weight decay was set at 0.01, and a cosine learning rate decay strategy was employed. During inference, the models utilized a greedy decoding strategy. |