reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exponential Scaling of Factual Inconsistency in Data-to-Text Generation with Fine-Tuned LLMs

Authors: Joy Mahapatra, Soumyajit Roy, Utpal Garain

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments are conducted across six diverse LLM families and five D2T datasets. Factual inconsistency is inversely measured using four state-of-the-art consistency metrics, including human evaluation. We employ QLo RA, Prefix-Tuning, and full fine-tuning to fine-tune the LLMs. Our analysis, validated through the Va CScal framework, consistently shows that factual inconsistency in D2T generation follows exponential scaling with respect to model (LLM) size, compute (FLOPs), and fine-tuning data size challenging the prevailing assumption of power law scaling.
Researcher Affiliation	Academia	Joy Mahapatra EMAIL Indian Statistical Institute Kolkata Soumyajit Roy EMAIL Indian Statistical Institute Kolkata Utpal Garain EMAIL Indian Statistical Institute Kolkata
Pseudocode	No	The paper describes methods using prose and mathematical equations (e.g., Power Law Scaling Model: f(x) = ( Axα + B x 0 0 otherwise (1) and Exponential Scaling Model : f(x) = ( Ceβx + D x 0 0 otherwise (2)) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	All code used in our scaling experiments is provided in the Supplementary Material.
Open Datasets	Yes	Our experiments incorporate five widely used D2T datasets, covering three major D2T generation types: DART (Nan et al., 2021) and Web NLG (Gardent et al., 2017) for graph-to-text, Wiki Table Text (Bao et al., 2018) for table-to-text, and E2E (Dusek et al., 2018) and Vi GGO (Juraska et al., 2018) for MR-to-text. ... All datasets and model variants are sourced from the Hugging Face hub (Wolf et al., 2020).
Dataset Splits	Yes	Fine-tuning and evaluation are conducted separately on each D2T dataset, using their respective training and testing splits.
Hardware Specification	Yes	All experiments were conducted using a single NVIDIA A6000 GPU (48 GB) and a single NVIDIA A100 GPU (80 GB).
Software Dependencies	No	In all of our experiments, we use the Sci Py library (Virtanen et al., 2020) to train both scaling models (power law and exponential) on factual inconsistency results. ... For all fine-tuning and model quantization tasks involving the LLM families, we make extensive use of the transformers library (Wolf et al., 2020) from Hugging Face. The paper mentions the Sci Py and transformers libraries with citations, but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We use two parameter-efficient finetuning strategies QLo RA (Dettmers et al., 2023) and Prefix-Tuning (Li & Liang, 2021) to fine-tune all LLMs on each D2T dataset. ... In the QLo RA setup, we use a reduced rank (r = 16), applied primarily to the attention and feedforward modules of the LLMs. For Prefix-Tuning, we use a virtual prefix of 32 tokens. ... we fix the learning rate at 1.00e 04 for trainable parameters. For full fine-tuning , we use a learning rate of 5.00e 05. For decoding, we consider both greedy decoding and nucleus sampling. For nucleus sampling, we set the nucleus size to 0.95.