reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FactGen: Faithful Text Generation by Factuality-aware Pre-training and Contrastive Ranking Fine-tuning

Authors: ZhiBin Lan, Wei Li, Jinsong Su, Xinyan Xiao, Jiachen Liu, Wenhao Wu, Yajuan Lyu

JAIR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on three conditional text generation tasks demonstrate the effectiveness and generality of our training framework.
Researcher Affiliation	Collaboration	Zhibin Lan EMAIL School of Informatics, Xiamen University, China Shanghai Artificial Intelligence Laboratory, China Wei Li EMAIL Baidu, Beijing, China Jinsong Su EMAIL (Corresponding author) School of Informatics, Xiamen University, China Shanghai Artificial Intelligence Laboratory, China Xinyan Xiao EMAIL Jiachen Liu EMAIL Wenhao Wu EMAIL Yajuan Lyu EMAIL Baidu, Beijing, China
Pseudocode	No	The paper describes the model architecture and training steps in text and mathematical formulas but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states: "We develop our model based on the open-source toolkit Transformers1. 1. https://github.com/huggingface/transformers" This indicates the use of a third-party toolkit, not the release of the authors' specific implementation code for Fact Gen.
Open Datasets	Yes	We pre-train the model on the Real News-like (Raffel et al., 2020) dataset (about 35G corpus) and evaluate it on three text generation tasks: text summarization, table-to-text generation, and dialogue generation. For the text summarization task, we evaluate the performance of the model on XSum (Narayan et al., 2018) and CNN/DM (Hermann et al., 2015), both of which are the most commonly-used datasets. For the table-to-text task, we follow Liu et al. (2021a) to conduct experiments on WIKIPERSON (Wang et al., 2018) dataset. In the experiments of dialogue generation, we use the dialogue NLI (Welleck et al., 2019) as our evaluation dataset.
Dataset Splits	Yes	Table 2: Statistics of datasets for evaluation. Datasets #Train #Valid #Test XSum 204,045 11,332 11,334 CNN/DM 287,227 13,368 11,490 WIKIPERSON 250,186 30,487 29,982 Dialogue NLI 310,110 16,500 12,376
Hardware Specification	Yes	We conduct our experiments on the V100 GPU with 32GB memory.
Software Dependencies	No	We develop our model based on the open-source toolkit Transformers1. The paper does not specify a version number for Transformers or any other software dependencies.
Experiment Setup	Yes	Factuality-aware Pre-training We use Adam as the optimizer with linear scheduled learning rate 2e-5, a weight decay of 0.01, and set the maximum number of input tokens to be 512 and a maximum number of output tokens to be 256. We use a batch size of 2048. We post-pretrain the full model for 20,000 steps with a warmup of 7,500 steps based on BART. Contrastive Ranking Fine-tuning We use Fact CC (Kryscinski et al., 2020) as the ranking metric in the text summarization task and dialogue generation task, and PARENT (Dhingra et al., 2019) as the ranking metric in the table-to-text generation task. For all datasets, we use Adam as the optimizer with polynomial scheduled learning rate 3e-5, label smoothing of 0.1, training epoch of 5, batch size of 64 and warmup of 10000. All models are simply fine-tuned on their respective datasets before fine-tuning with contrastive ranking loss. Task-specific hyper-parameters are shown in Table 3. We use diverse beam search (Vijayakumar et al., 2018) to generate 16 candidates for each data sample and set γ in Equation 8 to 100 when calculating the combined loss.