Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models
Authors: Shirley Anugrah Hayati, Taehee Jung, Tristan Bodding-Long, Sudipta Kar, Abhinav Sethy, Joo-Kyung Kim, Dongyeop Kang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first create a new Co I dataset with our proposed LLM-based compositionality checker, and then evaluate our model s performance in handling (1) traditional single instructions and (2) compositional instructions. Our work is closely related to other instruction-tuning works and compositional studies in NLP, as summarized in Table 1. |
| Researcher Affiliation | Collaboration | 1University of Minnesota 2Amazon 3Grammarly |
| Pseudocode | No | The paper describes methods like 'Automatic Dataset Creation Pipeline' and 'Instruction Composition' through textual descriptions and flowcharts (Figure 3), but it does not contain a distinct pseudocode block or algorithm. |
| Open Source Code | Yes | Code and Datasets {https://github.com/amazonscience/chain-of-instructions} |
| Open Datasets | Yes | Code and Datasets {https://github.com/amazonscience/chain-of-instructions} ... Seed Datatsets We curate a new compositional instruction dataset from existing single task instruction dataset: SUPER-NATURALINSTRUCTIONS (SUP-NATINS) (Wang et al. 2022). ... Downstream Task In addition to Co I test sets, we examine the usefulness of Co I on the downstream task of multilingual summarization using Wiki Lingua (Ladhak et al. 2020) |
| Dataset Splits | Yes | In each pair or triplet, we randomly select at most three instances and divide them into training and testing sets. For the longer chains (4, 5), we only use them for testing. Please find Appendix ?? for the detailed statistics. Table 2: Dataset statistics per chain length. |
| Hardware Specification | No | The paper mentions fine-tuning Alpaca7B and Mistral-7B-Instruct models in the 'Experiment Setup' section, but it does not specify any hardware details like GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions using GPT 3.5 Turbo for data creation and base models like Alpaca7B and Mistral-7B-Instruct, as well as Sentence BERT with Distil Roberta for embeddings. However, it does not provide specific version numbers for the underlying software libraries (e.g., PyTorch, Transformers library versions) used for implementation. |
| Experiment Setup | No | The paper outlines the models used (Alpaca7B, Mistral-7B-Instruct), metrics (ROUGE-L, LLM as judge), and test sets (Co I Test set, BIG-Bench Hard, Downstream Task). It also mentions 'seven-shot demonstrations' for CoT prompting. However, it defers general fine-tuning details to an appendix ('Fine-tuning details in Appendix') and does not explicitly provide hyperparameters such as learning rate, batch size, or number of epochs in the main text. |