reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Putting People in LLMs’ Shoes: Generating Better Answers via Question Rewriter

Authors: Junhao Chen, Bowen Wang, Zhouqiang Jiang, Yuta Nakashima

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments across multiple black-box LLMs and long-form question answering (LFQA) datasets demonstrate the efficacy of our method.
Researcher Affiliation	Academia	Junhao Chen, Bowen Wang*, Zhouqiang Jiang, Yuta Nakashima Osaka University, Japan EMAIL, EMAIL
Pseudocode	No	The paper describes the method pipeline with a diagram (Figure 2) and in narrative text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/3244we/Question-Rewriter
Open Datasets	Yes	We evaluate three distinct LFQA datasets, each equipped with automated evaluation criteria. K-QA (Manes et al. 2024), sourced from the medical domain...Truthful QA (Lin, Hilton, and Evans 2022), covering multiple domains...OASST1QA, derived from the multi-turn dialogue alignment dataset OASST16, incorporates a criterion Spref that measures human preference for answers using a pre-trained reward model.
Dataset Splits	Yes	The original LFQA dataset is divided into three parts: training, validation, and testing. R is trained on the training set (i.e., D) for one epoch, and we select the best model that most prefer q s to q s. Specifically, we define preference score PS as PS = E [1[PR(q\|t, q) > PR(q\|t, q)]] , (5) where 1[ ] gives 1 if the given condition is satisfied, and otherwise 0; the expectation is computed for all the q from the validation set and (q, q) P(q). Table 1: Statistics of LFQA datasets used to evaluate our method. Columns for Training, Validation, and Testing give the numbers of samples in respective dataset splits.
Hardware Specification	Yes	All our testing and training, except for the DPO training of OASST1QA, are conducted on a system equipped with four NVIDIA A100-40GB-PCIE. Due to the extensive length of OASST1QA, we only used samples whose question plus the prompt t and rewritten questions q for question rewriting is less than or equal to 512 tokens and conducted the DPO training on a system with four NVIDIA A100-80GB-PCIe.
Software Dependencies	No	The paper mentions various LLM models used (e.g., Llama3-8B-instruct, Mistral-7B-v0.2, GPT-3.5-turbo), but it does not specify versions for ancillary software libraries or frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	During DPO training, we set the dropout rate to 0.8, the training batch size to 32, and the testing batch size to 64, maintaining all other parameters at their default settings in the source code. For sampling rewritten questions, we use top-p sampling, where the cumulative probability for top-p sampling is set to 0.999, and the temperature of R0 is 1, to ensure diversity. We sample 100 unique rewritten questions for each of the original questions and terminate the sampling after 10,000 attempts. N+ and N are defaulted to (10, 20), (5, 10), and (4, 5) in K-QA, TQA, and OQA respectively. The maximum token length is set to 512 during feedback collection and testing.