reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bayesian WeakS-to-Strong from Text Classification to Generation

Authors: Ziyun Cui, Ziyang Zhang, Guangzhi Sun, Wen Wu, Chao Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The proposed Bayesian Weak S-to-Strong approach was first evaluated on a classification task. Table 1 shows the respective performance of the strong model and the weak models trained using ground-truth labels... Results of Weak(S)-to-Strong approaches are shown in Table 2. Section 6: RESULTS
Researcher Affiliation	Academia	1Department of Electronic Engineering, Tsinghua University, Beijing, China 2Shanghai Artificial Intelligence Laboratory, Shanghai, China 3Department of Engineering, University of Cambridge, Cambridge, UK EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It describes methodologies using mathematical equations and text, but no distinct algorithm boxes or pseudocode sections are present.
Open Source Code	Yes	Code is available in https://github.com/cuiziyun/Bayesian WS2S.
Open Datasets	Yes	The Sci Q dataset (Welbl et al., 2017) is used... SLURP dataset (Bastianelli et al., 2020) was used... additional experiments were conducted on another dataset, Cosmos QA (Huang et al., 2019)
Dataset Splits	Yes	For the classification task... 5k data samples were extracted for training weak models and another 5k samples were reserved for generating weak labels to train the strong model. The standard test set which contains 1k data samples was used for the test. For slot filling... 2k utterances from the train split were extracted for training the weak models, and another 2k utterances were reserved for generating weak labels and training the strong model. We report the performance of both weak and strong models on the standard SLURP test set.
Hardware Specification	Yes	All models were trained on NVIDIA A800 GPUs using the bfloat16 data type.
Software Dependencies	No	The paper mentions optimizers like Adam and AdamW, and data type bfloat16, but does not provide specific version numbers for software libraries (e.g., Python, PyTorch, TensorFlow, CUDA) used in the implementation.
Experiment Setup	Yes	For the classification tasks, the Adam optimizer was used with a cosine learning rate scheduler and no warm-up period. The batch size was set to 32, with a mini-batch size of 1. The weak models were finetuned on the ground-truth labels with an initial learning rate of 5 10 5, while the strong models were trained with a starting learning rate of 1 10 5... The Weak(S)-to-Strong training was run for two epochs. For generation tasks, the Adam W optimizer was used with a linear learning rate scheduler, also with no warm-up. The initial learning rates were set at 4 10 5 for GPT2-Large and Pythia-1.4B, and 8 10 5 for OPT-1.3B, with a batch size of 8 (mini-batch size of 4). These models were trained for 15 epochs. The strong model was trained with a batch size of 2 (mini-batch size of 1) and an initial learning rate of 1 10 5, evaluated at the end of two epochs. For DPO, the initial learning rate was set to 5 10 7 for two epochs, with the c DPO s hyperparameter β set to 2.0 and label smoothing ϵ as 0.1.