Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Authors: Hsun-Yu Kuo, Yin-Hsiang Liao, Yu-Chieh Chao, Wei-Yun Ma, Pu-Jen Cheng

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically assessed the effectiveness of our methods on multiple text classification tasks, and the results showed that leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator.
Researcher Affiliation Academia 1Academia Sinica 2National Taiwan University 3Swiss Federal Institute of Technology in Lausanne (EPFL) EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 outlines how we use IMP-Loss. ... Algorithm 2 outlines how DIMP-Loss is used in training a model.
Open Source Code Yes Code: https://github.com/Hsun-Yu/DIMP-Loss (footnote on page 1) and To ensure reproducibility, we provided the source code and generated dataset in supplementary materials...
Open Datasets Yes We assessed our proposed methods by comparing them with standard loss functions across several text classification benchmarks, including Financial Phrasebank (Financial) (Malo et al., 2014), irony detection (Tweet Irony) (Van Hee et al., 2018), and the MRPC dataset from GLUE (Wang et al., 2018).
Dataset Splits Yes In our experiments, we referred the large real-world data DP to the original training set from each benchmark and the small real-world data DP to the original development set, with the sizes from approximately 200 to 400, as shown in Table 5. ... For the Financial Phrasebank, ... we randomly divided it into training (70%), validation (5%), and testing (25%) sets like the previous work (Li et al., 2023). Table 5: Data size of each split: Train 3392, Dev 242, Test 1212 for Financial.
Hardware Specification Yes All experiments were conducted using Py Torch (Paszke et al., 2019) and Huggingface (for models and datasets) on V100 GPUs with 32GB memory.
Software Dependencies No All experiments were conducted using Py Torch (Paszke et al., 2019) and Huggingface (for models and datasets). For our experiments, we used a pre-trained BERT-base model (Devlin et al., 2019) from Huggingface s transformers library (Wolf et al., 2020). (The text mentions the software names and citations but does not provide specific version numbers for PyTorch or the Huggingface libraries used.)
Experiment Setup Yes We fine-tuned the model with hyperparameters selected from the following ranges: learning rate {6e-6, 6e-5}, epochs {5, 7}, and batch size {32, 64}. Other hyperparameters were set to the default values provided by Huggingface s trainer for text classification. The best checkpoint was selected based on the accuracy of the development set. We repeated each experiment with four random seeds. ... The downstream models and checkers were trained for 5 epochs with a batch size of 32. The batch size was set to 64 during the precalculating constant weights phase. The inner loop epochs were set to 1 for Sun Gen.