reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning

Authors: Yujian Liu, Shiyu Chang, Tommi Jaakkola, Yang Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that PREREQ-TUNE outperforms existing baselines in improving LLM s factuality across short QA and long-form generation tasks. It also opens new possibilities for knowledge-controlled generation in LLMs.
Researcher Affiliation	Collaboration	Yujian Liu UC Santa Barbara EMAIL Shiyu Chang UC Santa Barbara EMAIL Tommi Jaakkola MIT CSAIL EMAIL Yang Zhang MIT-IBM Watson AI Lab EMAIL
Pseudocode	No	The paper describes the methodology with detailed steps and mathematical formulations (e.g., equations 1, 2, 3, 4) and a high-level diagram in Figure 1, but it does not include a distinct pseudocode block or algorithm section with structured, code-like formatting.
Open Source Code	Yes	Our code is available at https://github.com/UCSB-NLP-Chang/Prereq_tune.git.
Open Datasets	Yes	For QA, we evaluate on Pop QA (Mallen et al., 2023) and Hotpot QA (Yang et al., 2018). Pop QA contains factoid questions about 16 relations... Hotpot QA contains questions that require multiple reasoning steps... Additionally, for biography generation, we use the 183 labeled persons in Min et al. (2023) as test set to keep consistent with prior works (Lin et al., 2024).
Dataset Splits	Yes	Table 4: Number of examples in the real downstream task dataset DT. Training, Validation, Test counts are provided for Persons, Medical Entities, Pop QA, and Hotpot QA. For Pop QA, we randomly split Pop QA data into training, validation, and test to ensure no overlapping subjects, and we use the original split for Hotpot QA.
Hardware Specification	Yes	All experiments are conducted on 8 80GB NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions 'We base our implementations on alignment-handbook.2' which points to a specific project. However, it does not provide specific version numbers for this handbook or any other key software components such as Python, PyTorch, or CUDA, which are required for replication.
Experiment Setup	Yes	We search training steps, learning rate, and Lo RA rank on the validation set for all methods. Table 6 lists the hyperparameters we search. During inference, we use greedy decoding for all methods. We set β = 0.1, learning rate as 1e 6, and train for 500 steps following (Lin et al., 2024).