reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

Authors: Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, Kai Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results demonstrate that Mask-DPO can significantly improve the factuality of LLMs responses to questions from both in-domain and out-of-domain datasets, although these questions and their corresponding topics are unseen during training. Only trained on the ANAH train set, the score of Llama3.1-8B-Instruct on the ANAH test set is improved from 49.19% to 77.53%, even surpassing the score of Llama3.1-70B-Instruct (53.44%), while its Fact Score on the out-of-domain Biography dataset is also improved from 30.29% to 39.39%.
Researcher Affiliation	Collaboration	Yuzhe Gu1,2 Wenwei Zhang2 Chengqi Lyu2 Dahua Lin2,3 Kai Chen2 1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3MMLab, The Chinese University of Hong Kong EMAIL
Pseudocode	No	The paper describes methods using mathematical formulations and descriptive text but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code is available at https://github.com/open-compass/ANAH.
Open Datasets	Yes	For the in-domain data, we use a subset of the ANAH-v2 (Gu et al., 2024) data as the test set, containing 177 questions, and another 8046 questions as the training set. ... For the out-of-domain data, we use a subset of Biography (Min et al., 2023) as the test set, which has 183 questions about biography generation and has no overlap with our training set.
Dataset Splits	Yes	For the in-domain data, we use a subset of the ANAH-v2 (Gu et al., 2024) data as the test set, containing 177 questions, and another 8046 questions as the training set. ... For the out-of-domain data, we use a subset of Biography (Min et al., 2023) as the test set, which has 183 questions about biography generation and has no overlap with our training set.
Hardware Specification	Yes	Our model is trained on 8 NVIDIA A100 GPUs.
Software Dependencies	No	The training and inference frameworks are Xtuner (Contributors, 2023b) and LMDeploy (Contributors, 2023a) respectively. While tools are mentioned, specific version numbers (e.g., 1.0.0) for these frameworks are not provided, only the year of their corresponding citations.
Experiment Setup	Yes	We train the base model with the following settings and hyperparameters: the epoch is 3, the learning rate is 5e-6, the batch size is 64, and the Adam W optimizer is the cosine annealing learning rate scheduler. ... The decoding strategy involves the top-k (k = 40) sampling with a temperature of 0.8.