reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Modeling Future Conversation Turns to Teach LLMs to Ask Clarifying Questions

Authors: Michael Zhang, W. Bradley Knox, Eunsol Choi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On open-domain QA datasets with multiple annotations, we evaluate systems based on their ability to ask clarifying questions to recover each user s interpretation and expected answer. We compare systems trained using our proposed preference labeling methods against standard methods, which assign preferences based on only prior context. Our method achieves a 5% improvement in F1 measured against the answer set from different interpretations of each query, showing the value of modeling future conversation turns.
Researcher Affiliation	Academia	New York University , The University of Texas at Austin
Pseudocode	No	The paper includes figures describing methods and interactions (e.g., Figure 1, Figure 2, Figure 3) but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release all code and data at https://github.com/ mikejqzhang/clarifying_questions.
Open Datasets	Yes	We perform our experiments on the Natural Questions (NQ-Open) (Kwiatkowski et al., 2019; Lee et al., 2019) and Ambig QA (Min et al., 2020) datasets.
Dataset Splits	Yes	For each method of generating feasible answer sets (Yhuman and Ymodel), we generate an SFT dataset of 4,400 input query and clarifying question pairs (x, q), which we split into training (4,000) and development (400) splits. Between both datasets, this gives us a total of 19,807 (x, q, ai, yi) examples. In Appendix A, we include examples and the exact prompts. For RLHF training, we use examples from the NQ-Open training set after removing examples used to generate our SFT datasets from NQ-Open. This leaves us with 70,904 remaining input questions, which we split into training and develpent splits (64,584 and 6,320).
Hardware Specification	Yes	We perform all experiments on a single machine with 8 A40 (48GB) GPUs using the transformers library (Wolf et al., 2020) and the Adam W optimizer (all training runs completed within 24 hours).
Software Dependencies	No	We perform all experiments on a single machine with 8 A40 (48GB) GPUs using the transformers library (Wolf et al., 2020) and the Adam W optimizer (all training runs completed within 24 hours). While the 'transformers' library is mentioned, no specific version number is provided for it or any other software component used in the experiments.
Experiment Setup	Yes	During SFT training, we train all models with a learning rate of 5e-5 and batch size of 32. Training was performed for up to 5 epochs, evaluating on our development set after each epoch and selecting using the best performing checkpoint. For DPO training, we merge Lo RA checkpoints from our SFT-only baseline and train using a KL regularization factor of 0.1 and a learning rate of 5e-6 in all experiments. For training was performed for up to 2 epochs until loss converges on development data, selecting the best performing checkpoint on development data. For Llama2-7b based methods, we train with a batch size of 32, evaluating every 750 steps. For Gemma-7b based methods we train with a batch size of 16, evaluating every 1500 steps.