Putting People in LLMs’ Shoes: Generating Better Answers via Question Rewriter

Authors: Junhao Chen, Bowen Wang, Zhouqiang Jiang, Yuta Nakashima

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments across multiple black-box LLMs and long-form question answering (LFQA) datasets demonstrate the efficacy of our method.
Researcher Affiliation Academia Junhao Chen, Bowen Wang*, Zhouqiang Jiang, Yuta Nakashima Osaka University, Japan EMAIL, EMAIL
Pseudocode No The paper describes the method pipeline with a diagram (Figure 2) and in narrative text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/3244we/Question-Rewriter
Open Datasets Yes We evaluate three distinct LFQA datasets, each equipped with automated evaluation criteria. K-QA (Manes et al. 2024), sourced from the medical domain...Truthful QA (Lin, Hilton, and Evans 2022), covering multiple domains...OASST1QA, derived from the multi-turn dialogue alignment dataset OASST16, incorporates a criterion Spref that measures human preference for answers using a pre-trained reward model.
Dataset Splits Yes The original LFQA dataset is divided into three parts: training, validation, and testing. R is trained on the training set (i.e., D) for one epoch, and we select the best model that most prefer q s to q s. Specifically, we define preference score PS as PS = E [1[PR(q|t, q) > PR(q|t, q)]] , (5) where 1[ ] gives 1 if the given condition is satisfied, and otherwise 0; the expectation is computed for all the q from the validation set and (q, q) P(q). Table 1: Statistics of LFQA datasets used to evaluate our method. Columns for Training, Validation, and Testing give the numbers of samples in respective dataset splits.
Hardware Specification Yes All our testing and training, except for the DPO training of OASST1QA, are conducted on a system equipped with four NVIDIA A100-40GB-PCIE. Due to the extensive length of OASST1QA, we only used samples whose question plus the prompt t and rewritten questions q for question rewriting is less than or equal to 512 tokens and conducted the DPO training on a system with four NVIDIA A100-80GB-PCIe.
Software Dependencies No The paper mentions various LLM models used (e.g., Llama3-8B-instruct, Mistral-7B-v0.2, GPT-3.5-turbo), but it does not specify versions for ancillary software libraries or frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes During DPO training, we set the dropout rate to 0.8, the training batch size to 32, and the testing batch size to 64, maintaining all other parameters at their default settings in the source code. For sampling rewritten questions, we use top-p sampling, where the cumulative probability for top-p sampling is set to 0.999, and the temperature of R0 is 1, to ensure diversity. We sample 100 unique rewritten questions for each of the original questions and terminate the sampling after 10,000 attempts. N+ and N are defaulted to (10, 20), (5, 10), and (4, 5) in K-QA, TQA, and OQA respectively. The maximum token length is set to 512 during feedback collection and testing.