Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning

Authors: Yujian Liu, Shiyu Chang, Tommi Jaakkola, Yang Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that PREREQ-TUNE outperforms existing baselines in improving LLM s factuality across short QA and long-form generation tasks. It also opens new possibilities for knowledge-controlled generation in LLMs.
Researcher Affiliation Collaboration Yujian Liu UC Santa Barbara EMAIL Shiyu Chang UC Santa Barbara EMAIL Tommi Jaakkola MIT CSAIL EMAIL Yang Zhang MIT-IBM Watson AI Lab EMAIL
Pseudocode No The paper describes the methodology with detailed steps and mathematical formulations (e.g., equations 1, 2, 3, 4) and a high-level diagram in Figure 1, but it does not include a distinct pseudocode block or algorithm section with structured, code-like formatting.
Open Source Code Yes Our code is available at https://github.com/UCSB-NLP-Chang/Prereq_tune.git.
Open Datasets Yes For QA, we evaluate on Pop QA (Mallen et al., 2023) and Hotpot QA (Yang et al., 2018). Pop QA contains factoid questions about 16 relations... Hotpot QA contains questions that require multiple reasoning steps... Additionally, for biography generation, we use the 183 labeled persons in Min et al. (2023) as test set to keep consistent with prior works (Lin et al., 2024).
Dataset Splits Yes Table 4: Number of examples in the real downstream task dataset DT. Training, Validation, Test counts are provided for Persons, Medical Entities, Pop QA, and Hotpot QA. For Pop QA, we randomly split Pop QA data into training, validation, and test to ensure no overlapping subjects, and we use the original split for Hotpot QA.
Hardware Specification Yes All experiments are conducted on 8 80GB NVIDIA A100 GPUs.
Software Dependencies No The paper mentions 'We base our implementations on alignment-handbook.2' which points to a specific project. However, it does not provide specific version numbers for this handbook or any other key software components such as Python, PyTorch, or CUDA, which are required for replication.
Experiment Setup Yes We search training steps, learning rate, and Lo RA rank on the validation set for all methods. Table 6 lists the hyperparameters we search. During inference, we use greedy decoding for all methods. We set β = 0.1, learning rate as 1e 6, and train for 500 steps following (Lin et al., 2024).