reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models

Authors: Shirley Anugrah Hayati, Taehee Jung, Tristan Bodding-Long, Sudipta Kar, Abhinav Sethy, Joo-Kyung Kim, Dongyeop Kang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first create a new Co I dataset with our proposed LLM-based compositionality checker, and then evaluate our model s performance in handling (1) traditional single instructions and (2) compositional instructions. Our work is closely related to other instruction-tuning works and compositional studies in NLP, as summarized in Table 1.
Researcher Affiliation	Collaboration	1University of Minnesota 2Amazon 3Grammarly
Pseudocode	No	The paper describes methods like 'Automatic Dataset Creation Pipeline' and 'Instruction Composition' through textual descriptions and flowcharts (Figure 3), but it does not contain a distinct pseudocode block or algorithm.
Open Source Code	Yes	Code and Datasets {https://github.com/amazonscience/chain-of-instructions}
Open Datasets	Yes	Code and Datasets {https://github.com/amazonscience/chain-of-instructions} ... Seed Datatsets We curate a new compositional instruction dataset from existing single task instruction dataset: SUPER-NATURALINSTRUCTIONS (SUP-NATINS) (Wang et al. 2022). ... Downstream Task In addition to Co I test sets, we examine the usefulness of Co I on the downstream task of multilingual summarization using Wiki Lingua (Ladhak et al. 2020)
Dataset Splits	Yes	In each pair or triplet, we randomly select at most three instances and divide them into training and testing sets. For the longer chains (4, 5), we only use them for testing. Please find Appendix ?? for the detailed statistics. Table 2: Dataset statistics per chain length.
Hardware Specification	No	The paper mentions fine-tuning Alpaca7B and Mistral-7B-Instruct models in the 'Experiment Setup' section, but it does not specify any hardware details like GPU models, CPU types, or memory.
Software Dependencies	No	The paper mentions using GPT 3.5 Turbo for data creation and base models like Alpaca7B and Mistral-7B-Instruct, as well as Sentence BERT with Distil Roberta for embeddings. However, it does not provide specific version numbers for the underlying software libraries (e.g., PyTorch, Transformers library versions) used for implementation.
Experiment Setup	No	The paper outlines the models used (Alpaca7B, Mistral-7B-Instruct), metrics (ROUGE-L, LLM as judge), and test sets (Co I Test set, BIG-Bench Hard, Downstream Task). It also mentions 'seven-shot demonstrations' for CoT prompting. However, it defers general fine-tuning details to an appendix ('Fine-tuning details in Appendix') and does not explicitly provide hyperparameters such as learning rate, batch size, or number of epochs in the main text.