Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

Authors: Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, Kai Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate that Mask-DPO can significantly improve the factuality of LLMs responses to questions from both in-domain and out-of-domain datasets, although these questions and their corresponding topics are unseen during training. Only trained on the ANAH train set, the score of Llama3.1-8B-Instruct on the ANAH test set is improved from 49.19% to 77.53%, even surpassing the score of Llama3.1-70B-Instruct (53.44%), while its Fact Score on the out-of-domain Biography dataset is also improved from 30.29% to 39.39%.
Researcher Affiliation Collaboration Yuzhe Gu1,2 Wenwei Zhang2 Chengqi Lyu2 Dahua Lin2,3 Kai Chen2 1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3MMLab, The Chinese University of Hong Kong EMAIL
Pseudocode No The paper describes methods using mathematical formulations and descriptive text but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Code is available at https://github.com/open-compass/ANAH.
Open Datasets Yes For the in-domain data, we use a subset of the ANAH-v2 (Gu et al., 2024) data as the test set, containing 177 questions, and another 8046 questions as the training set. ... For the out-of-domain data, we use a subset of Biography (Min et al., 2023) as the test set, which has 183 questions about biography generation and has no overlap with our training set.
Dataset Splits Yes For the in-domain data, we use a subset of the ANAH-v2 (Gu et al., 2024) data as the test set, containing 177 questions, and another 8046 questions as the training set. ... For the out-of-domain data, we use a subset of Biography (Min et al., 2023) as the test set, which has 183 questions about biography generation and has no overlap with our training set.
Hardware Specification Yes Our model is trained on 8 NVIDIA A100 GPUs.
Software Dependencies No The training and inference frameworks are Xtuner (Contributors, 2023b) and LMDeploy (Contributors, 2023a) respectively. While tools are mentioned, specific version numbers (e.g., 1.0.0) for these frameworks are not provided, only the year of their corresponding citations.
Experiment Setup Yes We train the base model with the following settings and hyperparameters: the epoch is 3, the learning rate is 5e-6, the batch size is 64, and the Adam W optimizer is the cosine annealing learning rate scheduler. ... The decoding strategy involves the top-k (k = 40) sampling with a temperature of 0.8.