Exploring Format Consistency for Instruction Tuning
Authors: Shihao Liang, Runchu Tian, Kunlun Zhu, Yujia Qin, Huadong Wang, Xin Cong, Zhiyuan Liu, Xiaojiang Liu, Maosong Sun
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform analysis across five benchmarks and show that our method successfully mitigates the format inconsistency issue and improves the generalization performance on unseen instructions in both settings. ... Testing-time format transfer results are shown in Table 2 ... Training-time format transfer results are shown in Table 3 |
| Researcher Affiliation | Collaboration | Shihao Liang EMAIL Department of Computer Science Tsinghua University ... Huadong Wang EMAIL Model Best Inc. ... Xiaojiang Liu EMAIL Apple |
| Pseudocode | No | The paper describes methods in narrative text and figures, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and trained models are publicly available at https://github.com/thunlp/Unified Instruction Tuning. |
| Open Datasets | Yes | For the testing-time setting, we select Ni-v2 (Wang et al., 2022b) as the training dataset and use Diverse Prompt (Honovich et al., 2022b), Flan (Wei et al., 2021), Cross Fit (Ye et al., 2021a), and Prompt Source (Bach et al., 2022) as the test dataset. |
| Dataset Splits | Yes | For the testing-time setting, we select Ni-v2 (Wang et al., 2022b) as the training dataset and use Diverse Prompt (Honovich et al., 2022b), Flan (Wei et al., 2021), Cross Fit (Ye et al., 2021a), and Prompt Source (Bach et al., 2022) as the test dataset. We evaluate the tasks that do not appear in the training stage. ... For the training-time setting, we use the training set of Ni-v2 together with Flan, Cross Fit, and P3 respectively for training and use the test set of Ni-v2 for evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for the experiments. |
| Software Dependencies | No | The paper mentions using specific models like GPT3.5 and GPT-J, but does not list specific versions of programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The hyper-parameters for training include a maximum source data length of 1024, a maximum target data length of 128, a cap of 100 instances per task for both training and evaluation, a batch size of 16 for training, a learning rate of 0.00001, a total of 2 training epochs, linear learning rate scheduling, and a warm-up period consisting of 1000 steps. |