Exploring Format Consistency for Instruction Tuning

Authors: Shihao Liang, Runchu Tian, Kunlun Zhu, Yujia Qin, Huadong Wang, Xin Cong, Zhiyuan Liu, Xiaojiang Liu, Maosong Sun

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform analysis across five benchmarks and show that our method successfully mitigates the format inconsistency issue and improves the generalization performance on unseen instructions in both settings. ... Testing-time format transfer results are shown in Table 2 ... Training-time format transfer results are shown in Table 3
Researcher Affiliation Collaboration Shihao Liang EMAIL Department of Computer Science Tsinghua University ... Huadong Wang EMAIL Model Best Inc. ... Xiaojiang Liu EMAIL Apple
Pseudocode No The paper describes methods in narrative text and figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code and trained models are publicly available at https://github.com/thunlp/Unified Instruction Tuning.
Open Datasets Yes For the testing-time setting, we select Ni-v2 (Wang et al., 2022b) as the training dataset and use Diverse Prompt (Honovich et al., 2022b), Flan (Wei et al., 2021), Cross Fit (Ye et al., 2021a), and Prompt Source (Bach et al., 2022) as the test dataset.
Dataset Splits Yes For the testing-time setting, we select Ni-v2 (Wang et al., 2022b) as the training dataset and use Diverse Prompt (Honovich et al., 2022b), Flan (Wei et al., 2021), Cross Fit (Ye et al., 2021a), and Prompt Source (Bach et al., 2022) as the test dataset. We evaluate the tasks that do not appear in the training stage. ... For the training-time setting, we use the training set of Ni-v2 together with Flan, Cross Fit, and P3 respectively for training and use the test set of Ni-v2 for evaluation.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for the experiments.
Software Dependencies No The paper mentions using specific models like GPT3.5 and GPT-J, but does not list specific versions of programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The hyper-parameters for training include a maximum source data length of 1024, a maximum target data length of 128, a cap of 100 instances per task for both training and evaluation, a batch size of 16 for training, a learning rate of 0.00001, a total of 2 training epochs, linear learning rate scheduling, and a warm-up period consisting of 1000 steps.