Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models

Authors: Shirley Anugrah Hayati, Taehee Jung, Tristan Bodding-Long, Sudipta Kar, Abhinav Sethy, Joo-Kyung Kim, Dongyeop Kang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first create a new Co I dataset with our proposed LLM-based compositionality checker, and then evaluate our model s performance in handling (1) traditional single instructions and (2) compositional instructions. Our work is closely related to other instruction-tuning works and compositional studies in NLP, as summarized in Table 1.
Researcher Affiliation Collaboration 1University of Minnesota 2Amazon 3Grammarly
Pseudocode No The paper describes methods like 'Automatic Dataset Creation Pipeline' and 'Instruction Composition' through textual descriptions and flowcharts (Figure 3), but it does not contain a distinct pseudocode block or algorithm.
Open Source Code Yes Code and Datasets {https://github.com/amazonscience/chain-of-instructions}
Open Datasets Yes Code and Datasets {https://github.com/amazonscience/chain-of-instructions} ... Seed Datatsets We curate a new compositional instruction dataset from existing single task instruction dataset: SUPER-NATURALINSTRUCTIONS (SUP-NATINS) (Wang et al. 2022). ... Downstream Task In addition to Co I test sets, we examine the usefulness of Co I on the downstream task of multilingual summarization using Wiki Lingua (Ladhak et al. 2020)
Dataset Splits Yes In each pair or triplet, we randomly select at most three instances and divide them into training and testing sets. For the longer chains (4, 5), we only use them for testing. Please find Appendix ?? for the detailed statistics. Table 2: Dataset statistics per chain length.
Hardware Specification No The paper mentions fine-tuning Alpaca7B and Mistral-7B-Instruct models in the 'Experiment Setup' section, but it does not specify any hardware details like GPU models, CPU types, or memory.
Software Dependencies No The paper mentions using GPT 3.5 Turbo for data creation and base models like Alpaca7B and Mistral-7B-Instruct, as well as Sentence BERT with Distil Roberta for embeddings. However, it does not provide specific version numbers for the underlying software libraries (e.g., PyTorch, Transformers library versions) used for implementation.
Experiment Setup No The paper outlines the models used (Alpaca7B, Mistral-7B-Instruct), metrics (ROUGE-L, LLM as judge), and test sets (Co I Test set, BIG-Bench Hard, Downstream Task). It also mentions 'seven-shot demonstrations' for CoT prompting. However, it defers general fine-tuning details to an appendix ('Fine-tuning details in Appendix') and does not explicitly provide hyperparameters such as learning rate, batch size, or number of epochs in the main text.