reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Weak to Strong Generalization for Large Language Models with Multi-capabilities

Authors: Yucheng Zhou, Jianbing Shen, Yu Cheng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we conduct extensive experiments to investigate weak to strong generalization for LLMs with multi-capabilities. The experiments reveal that different capabilities tend to remain relatively independent in this generalization, and the effectiveness of weak supervision is significantly impacted by the quality and diversity of the weak datasets. Moreover, the selfbootstrapping of the strong model leads to performance degradation due to its overconfidence and the limited diversity of its generated dataset. To address these issues, we proposed a novel training framework using reward models to select valuable data, thereby providing weak supervision for strong model training. In addition, we propose a two-stage training method on both weak and selected datasets to train the strong model. Experimental results demonstrate our method significantly improves the weak to strong generalization with multi-capabilities.
Researcher Affiliation	Academia	Yucheng Zhou1, Jianbing Shen1 , Yu Cheng2 1SKL-IOTSC, CIS, University of Macau, 2The Chinese University of Hong Kong EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes methods using equations and prose in sections like 'BACKGROUND AND NOTATION' and 'MULTI-CAPABILITIES WEAK TO STRONG GENERALIZATION WITH REWARD MODELS', but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about the release of source code for the described methodology, nor does it include links to a code repository.
Open Datasets	Yes	To explore the models performance in weak to strong generalization settings, we employ eight datasets that span various skills, such as GSM8K (Cobbe et al., 2021) for mathematical abilities, MC-TACO (Zhou et al., 2019) for temporal reasoning, SCAN (Lake & Baroni, 2018) for planning ability, CREAK (Onoe et al., 2021) for fact-checking and commonsense reasoning ability, ECQA (Aggarwal et al., 2021) for explainable commonsense reasoning, e-SNLI (Camburu et al., 2018) for logical reasoning ability, Open Book QA (Mihaylov et al., 2018) for fact reasoning, and Sci Q (Welbl et al., 2017) for science-related abilities.
Dataset Splits	Yes	Following Burns et al. (2023), each capability s dataset is split into a labeled training set, an unlabeled training set, and a test set to simulate the weak to strong generalization scenario.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A100 80G GPUs.
Software Dependencies	No	The paper mentions the use of Qwen-1.5 models and the Adam optimizer (Kingma & Ba, 2015), but does not provide specific version numbers for any software libraries, frameworks, or programming languages used for implementation.
Experiment Setup	Yes	To ensure a fair comparison, we followed the experimental setup from Burns et al. (2023), conducting all experiments with 2 epochs and a batch size of 40. The optimizer used was Adam (Kingma & Ba, 2015) with a learning rate of 1e-5. Weight decay was set at 0.01, and a cosine learning rate decay strategy was employed. During inference, the models utilized a greedy decoding strategy.