Improving Consistency in Large Language Models through Chain of Guidance

Authors: Harsh Raj, Vipul Gupta, Domenic Rosati, Subhabrata Majumdar

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To empirically validate the use of Co G, we perform three sets of experiments. First, to measure the efficacy of Co G, we generate paraphrased question-answer pairs zi (xi, yi) from a number of LLMs with and without Co G, and measure the consistency of the answers. Second, we perform a number of LLM finetuning leveraging the datasets and methods in Section 3.3, and report consistency metrics of LLMs before and after fine-tuning. Third, to measure any effect on LLM performance metrics, we report the evaluation results of LLMs with and without fine-tuning based on Open LLM Leaderboard2 benchmarks.
Researcher Affiliation Collaboration Harsh Raj EMAIL Northeastern University Vipul Gupta EMAIL Pennsylvania State University Domenic Rosati EMAIL Dalhousie University Subhabrata Majumdar EMAIL Vijil
Pseudocode Yes Listing 1 The paraphrase Prompt Template for In-context Paraphrasing... Listing 2 The rank Prompt Template for Co G... Listing 3 The answer Prompt Template
Open Source Code Yes 1Code is available at https://github.com/vijil AI/chain_of_guidance.
Open Datasets Yes Truthful QA is a widely used dataset for benchmarking LLMs on truthfulness. It has associated metrics and baselines to evaluate freeform text generation (Lin et al., 2022). Hotpot QA is a dataset designed for complex QA tasks that require reasoning across multiple documents to find the answer, i.e. multi-hop reasoning Yang et al. (2018). Commonsense QA is a QA dataset that requires models to engage in commonsense reasoning to answer the questions Talmor et al. (2019). Ambig QA is a dataset with multiple closely related questions that may seem identical but are not really Min et al. (2020). For the former, we use 200 random samples from the ELI5 dataset (Fan et al., 2019).
Dataset Splits Yes Small: Only Truthful QA is used. Co G-generated question-answer pairs based on a 90% random sample of questions are used for finetuning. Rest is kept for validation. Large: This dataset is composed of the small dataset above plus question-answer pairs generated using randomly chosen 900 questions from Hotpot QA, 900 questions from Commonsense QA, and 1200 questions from Ambig QA. Co G-generated data obtained using the rest of the samples in the 4 Q&A datasets are kept for validation.
Hardware Specification Yes All computations were performed on a cloud instance hosted on the Run Pod platform4, composed of a single A40 GPU with 48 GB of VRAM, 9 CPUs, and 50 GB RAM.
Software Dependencies No We use these two datasets to fine-tune two LLMs Llama 2 7B Chat and Llama 3 8B Instruct applying Lo RA and SFT using the open-source library axolotl3. The paper mentions 'axolotl' but does not provide a specific version number for it or any other software dependencies.
Experiment Setup Yes We run each finetuning for 5 epochs with a learning rate of 1e-5.