reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploring Format Consistency for Instruction Tuning

Authors: Shihao Liang, Runchu Tian, Kunlun Zhu, Yujia Qin, Huadong Wang, Xin Cong, Zhiyuan Liu, Xiaojiang Liu, Maosong Sun

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform analysis across five benchmarks and show that our method successfully mitigates the format inconsistency issue and improves the generalization performance on unseen instructions in both settings. ... Testing-time format transfer results are shown in Table 2 ... Training-time format transfer results are shown in Table 3
Researcher Affiliation	Collaboration	Shihao Liang EMAIL Department of Computer Science Tsinghua University ... Huadong Wang EMAIL Model Best Inc. ... Xiaojiang Liu EMAIL Apple
Pseudocode	No	The paper describes methods in narrative text and figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code and trained models are publicly available at https://github.com/thunlp/Unified Instruction Tuning.
Open Datasets	Yes	For the testing-time setting, we select Ni-v2 (Wang et al., 2022b) as the training dataset and use Diverse Prompt (Honovich et al., 2022b), Flan (Wei et al., 2021), Cross Fit (Ye et al., 2021a), and Prompt Source (Bach et al., 2022) as the test dataset.
Dataset Splits	Yes	For the testing-time setting, we select Ni-v2 (Wang et al., 2022b) as the training dataset and use Diverse Prompt (Honovich et al., 2022b), Flan (Wei et al., 2021), Cross Fit (Ye et al., 2021a), and Prompt Source (Bach et al., 2022) as the test dataset. We evaluate the tasks that do not appear in the training stage. ... For the training-time setting, we use the training set of Ni-v2 together with Flan, Cross Fit, and P3 respectively for training and use the test set of Ni-v2 for evaluation.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for the experiments.
Software Dependencies	No	The paper mentions using specific models like GPT3.5 and GPT-J, but does not list specific versions of programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	The hyper-parameters for training include a maximum source data length of 1024, a maximum target data length of 128, a cap of 100 instances per task for both training and evaluation, a batch size of 16 for training, a learning rate of 0.00001, a total of 2 training epochs, linear learning rate scheduling, and a warm-up period consisting of 1000 steps.