Putting People in LLMs’ Shoes: Generating Better Answers via Question Rewriter
Authors: Junhao Chen, Bowen Wang, Zhouqiang Jiang, Yuta Nakashima
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments across multiple black-box LLMs and long-form question answering (LFQA) datasets demonstrate the efficacy of our method. |
| Researcher Affiliation | Academia | Junhao Chen, Bowen Wang*, Zhouqiang Jiang, Yuta Nakashima Osaka University, Japan EMAIL, EMAIL |
| Pseudocode | No | The paper describes the method pipeline with a diagram (Figure 2) and in narrative text, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/3244we/Question-Rewriter |
| Open Datasets | Yes | We evaluate three distinct LFQA datasets, each equipped with automated evaluation criteria. K-QA (Manes et al. 2024), sourced from the medical domain...Truthful QA (Lin, Hilton, and Evans 2022), covering multiple domains...OASST1QA, derived from the multi-turn dialogue alignment dataset OASST16, incorporates a criterion Spref that measures human preference for answers using a pre-trained reward model. |
| Dataset Splits | Yes | The original LFQA dataset is divided into three parts: training, validation, and testing. R is trained on the training set (i.e., D) for one epoch, and we select the best model that most prefer q s to q s. Specifically, we define preference score PS as PS = E [1[PR(q|t, q) > PR(q|t, q)]] , (5) where 1[ ] gives 1 if the given condition is satisfied, and otherwise 0; the expectation is computed for all the q from the validation set and (q, q) P(q). Table 1: Statistics of LFQA datasets used to evaluate our method. Columns for Training, Validation, and Testing give the numbers of samples in respective dataset splits. |
| Hardware Specification | Yes | All our testing and training, except for the DPO training of OASST1QA, are conducted on a system equipped with four NVIDIA A100-40GB-PCIE. Due to the extensive length of OASST1QA, we only used samples whose question plus the prompt t and rewritten questions q for question rewriting is less than or equal to 512 tokens and conducted the DPO training on a system with four NVIDIA A100-80GB-PCIe. |
| Software Dependencies | No | The paper mentions various LLM models used (e.g., Llama3-8B-instruct, Mistral-7B-v0.2, GPT-3.5-turbo), but it does not specify versions for ancillary software libraries or frameworks (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | During DPO training, we set the dropout rate to 0.8, the training batch size to 32, and the testing batch size to 64, maintaining all other parameters at their default settings in the source code. For sampling rewritten questions, we use top-p sampling, where the cumulative probability for top-p sampling is set to 0.999, and the temperature of R0 is 1, to ensure diversity. We sample 100 unique rewritten questions for each of the original questions and terminate the sampling after 10,000 attempts. N+ and N are defaulted to (10, 20), (5, 10), and (4, 5) in K-QA, TQA, and OQA respectively. The maximum token length is set to 512 during feedback collection and testing. |