reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Question Generation through Diversity-Seeking Reinforcement Learning with Bilevel Policy Decomposition

Authors: Tianyu Ren, Hui Wang, Karen Rafferty

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our integrated approach, named BPD-DSRL, demonstrates superior performance over existing baselines on multiple question quality and diversity metrics across various QG benchmarks. ... Table 2 presents the comparative results on three widely-used QG benchmarks. ... We conduct a series of ablation studies on our BPD framework and DSRL objective.
Researcher Affiliation	Academia	Tianyu Ren, Hui Wang*, Karen Rafferty School of Electronics, Electrical Engineering and Computer Science, Queen s University Belfast, United Kingdom EMAIL
Pseudocode	No	The pseudo-code for BPD-DSRL training and further technical specifics are provided in the supplementary material.
Open Source Code	Yes	Code and other supplementary material https://github.com/Tianyu-Ren/BPD-DSRL
Open Datasets	Yes	Following previous work (Gou et al. 2023; Narayan et al. 2022; Wang et al. 2020), we conduct experiments on two QG datasets: SQu AD 1.1 (Rajpurkar et al. 2016) and News QA (Trischler et al. 2017).
Dataset Splits	Yes	Table 1: Statistics of the selected benchmarks. SQu AD 1.1 / 1 and SQu AD 1.1 / 2 are two different splits of SQu AD 1.1 from (Zhou et al. 2017) and (Du, Shao, and Cardie 2017).
Hardware Specification	No	The implementation of them is detailed in the supplementary material. (The main text does not specify hardware used for experiments.)
Software Dependencies	No	All of our QG models and outcome reward models start from the pre-trained checkpoints of T5-large (Raffel et al. 2020). ... To assess hallucination, we employ Spa Cy to extract named entities ... For precision and cost-effectiveness, we utilize GPT-3.5 (Turbo-0125) in a zero-shot setting as the QA model. (No specific version numbers for software libraries are mentioned in the main text.)
Experiment Setup	No	We use consistent hyperparameter configurations across all three datasets during training (SFT warm-up and RL) and inference. The implementation of them is detailed in the supplementary material. (Specific hyperparameter values are not provided in the main text.)