Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning
Authors: Yujian Liu, Shiyu Chang, Tommi Jaakkola, Yang Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that PREREQ-TUNE outperforms existing baselines in improving LLM s factuality across short QA and long-form generation tasks. It also opens new possibilities for knowledge-controlled generation in LLMs. |
| Researcher Affiliation | Collaboration | Yujian Liu UC Santa Barbara EMAIL Shiyu Chang UC Santa Barbara EMAIL Tommi Jaakkola MIT CSAIL EMAIL Yang Zhang MIT-IBM Watson AI Lab EMAIL |
| Pseudocode | No | The paper describes the methodology with detailed steps and mathematical formulations (e.g., equations 1, 2, 3, 4) and a high-level diagram in Figure 1, but it does not include a distinct pseudocode block or algorithm section with structured, code-like formatting. |
| Open Source Code | Yes | Our code is available at https://github.com/UCSB-NLP-Chang/Prereq_tune.git. |
| Open Datasets | Yes | For QA, we evaluate on Pop QA (Mallen et al., 2023) and Hotpot QA (Yang et al., 2018). Pop QA contains factoid questions about 16 relations... Hotpot QA contains questions that require multiple reasoning steps... Additionally, for biography generation, we use the 183 labeled persons in Min et al. (2023) as test set to keep consistent with prior works (Lin et al., 2024). |
| Dataset Splits | Yes | Table 4: Number of examples in the real downstream task dataset DT. Training, Validation, Test counts are provided for Persons, Medical Entities, Pop QA, and Hotpot QA. For Pop QA, we randomly split Pop QA data into training, validation, and test to ensure no overlapping subjects, and we use the original split for Hotpot QA. |
| Hardware Specification | Yes | All experiments are conducted on 8 80GB NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions 'We base our implementations on alignment-handbook.2' which points to a specific project. However, it does not provide specific version numbers for this handbook or any other key software components such as Python, PyTorch, or CUDA, which are required for replication. |
| Experiment Setup | Yes | We search training steps, learning rate, and Lo RA rank on the validation set for all methods. Table 6 lists the hyperparameters we search. During inference, we use greedy decoding for all methods. We set β = 0.1, learning rate as 1e 6, and train for 500 steps following (Lin et al., 2024). |