Exponential Scaling of Factual Inconsistency in Data-to-Text Generation with Fine-Tuned LLMs

Authors: Joy Mahapatra, Soumyajit Roy, Utpal Garain

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments are conducted across six diverse LLM families and five D2T datasets. Factual inconsistency is inversely measured using four state-of-the-art consistency metrics, including human evaluation. We employ QLo RA, Prefix-Tuning, and full fine-tuning to fine-tune the LLMs. Our analysis, validated through the Va CScal framework, consistently shows that factual inconsistency in D2T generation follows exponential scaling with respect to model (LLM) size, compute (FLOPs), and fine-tuning data size challenging the prevailing assumption of power law scaling.
Researcher Affiliation Academia Joy Mahapatra EMAIL Indian Statistical Institute Kolkata Soumyajit Roy EMAIL Indian Statistical Institute Kolkata Utpal Garain EMAIL Indian Statistical Institute Kolkata
Pseudocode No The paper describes methods using prose and mathematical equations (e.g., Power Law Scaling Model: f(x) = ( Axα + B x 0 0 otherwise (1) and Exponential Scaling Model : f(x) = ( Ceβx + D x 0 0 otherwise (2)) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes All code used in our scaling experiments is provided in the Supplementary Material.
Open Datasets Yes Our experiments incorporate five widely used D2T datasets, covering three major D2T generation types: DART (Nan et al., 2021) and Web NLG (Gardent et al., 2017) for graph-to-text, Wiki Table Text (Bao et al., 2018) for table-to-text, and E2E (Dusek et al., 2018) and Vi GGO (Juraska et al., 2018) for MR-to-text. ... All datasets and model variants are sourced from the Hugging Face hub (Wolf et al., 2020).
Dataset Splits Yes Fine-tuning and evaluation are conducted separately on each D2T dataset, using their respective training and testing splits.
Hardware Specification Yes All experiments were conducted using a single NVIDIA A6000 GPU (48 GB) and a single NVIDIA A100 GPU (80 GB).
Software Dependencies No In all of our experiments, we use the Sci Py library (Virtanen et al., 2020) to train both scaling models (power law and exponential) on factual inconsistency results. ... For all fine-tuning and model quantization tasks involving the LLM families, we make extensive use of the transformers library (Wolf et al., 2020) from Hugging Face. The paper mentions the Sci Py and transformers libraries with citations, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We use two parameter-efficient finetuning strategies QLo RA (Dettmers et al., 2023) and Prefix-Tuning (Li & Liang, 2021) to fine-tune all LLMs on each D2T dataset. ... In the QLo RA setup, we use a reduced rank (r = 16), applied primarily to the attention and feedforward modules of the LLMs. For Prefix-Tuning, we use a virtual prefix of 32 tokens. ... we fix the learning rate at 1.00e 04 for trainable parameters. For full fine-tuning , we use a learning rate of 5.00e 05. For decoding, we consider both greedy decoding and nucleus sampling. For nucleus sampling, we set the nucleus size to 0.95.