reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Consistency in Large Language Models through Chain of Guidance

Authors: Harsh Raj, Vipul Gupta, Domenic Rosati, Subhabrata Majumdar

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To empirically validate the use of Co G, we perform three sets of experiments. First, to measure the efficacy of Co G, we generate paraphrased question-answer pairs zi (xi, yi) from a number of LLMs with and without Co G, and measure the consistency of the answers. Second, we perform a number of LLM finetuning leveraging the datasets and methods in Section 3.3, and report consistency metrics of LLMs before and after fine-tuning. Third, to measure any effect on LLM performance metrics, we report the evaluation results of LLMs with and without fine-tuning based on Open LLM Leaderboard2 benchmarks.
Researcher Affiliation	Collaboration	Harsh Raj EMAIL Northeastern University Vipul Gupta EMAIL Pennsylvania State University Domenic Rosati EMAIL Dalhousie University Subhabrata Majumdar EMAIL Vijil
Pseudocode	Yes	Listing 1 The paraphrase Prompt Template for In-context Paraphrasing... Listing 2 The rank Prompt Template for Co G... Listing 3 The answer Prompt Template
Open Source Code	Yes	1Code is available at https://github.com/vijil AI/chain_of_guidance.
Open Datasets	Yes	Truthful QA is a widely used dataset for benchmarking LLMs on truthfulness. It has associated metrics and baselines to evaluate freeform text generation (Lin et al., 2022). Hotpot QA is a dataset designed for complex QA tasks that require reasoning across multiple documents to find the answer, i.e. multi-hop reasoning Yang et al. (2018). Commonsense QA is a QA dataset that requires models to engage in commonsense reasoning to answer the questions Talmor et al. (2019). Ambig QA is a dataset with multiple closely related questions that may seem identical but are not really Min et al. (2020). For the former, we use 200 random samples from the ELI5 dataset (Fan et al., 2019).
Dataset Splits	Yes	Small: Only Truthful QA is used. Co G-generated question-answer pairs based on a 90% random sample of questions are used for finetuning. Rest is kept for validation. Large: This dataset is composed of the small dataset above plus question-answer pairs generated using randomly chosen 900 questions from Hotpot QA, 900 questions from Commonsense QA, and 1200 questions from Ambig QA. Co G-generated data obtained using the rest of the samples in the 4 Q&A datasets are kept for validation.
Hardware Specification	Yes	All computations were performed on a cloud instance hosted on the Run Pod platform4, composed of a single A40 GPU with 48 GB of VRAM, 9 CPUs, and 50 GB RAM.
Software Dependencies	No	We use these two datasets to fine-tune two LLMs Llama 2 7B Chat and Llama 3 8B Instruct applying Lo RA and SFT using the open-source library axolotl3. The paper mentions 'axolotl' but does not provide a specific version number for it or any other software dependencies.
Experiment Setup	Yes	We run each finetuning for 5 epochs with a learning rate of 1e-5.