reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Authors: Hsun-Yu Kuo, Yin-Hsiang Liao, Yu-Chieh Chao, Wei-Yun Ma, Pu-Jen Cheng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically assessed the effectiveness of our methods on multiple text classification tasks, and the results showed that leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator.
Researcher Affiliation	Academia	1Academia Sinica 2National Taiwan University 3Swiss Federal Institute of Technology in Lausanne (EPFL) EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 outlines how we use IMP-Loss. ... Algorithm 2 outlines how DIMP-Loss is used in training a model.
Open Source Code	Yes	Code: https://github.com/Hsun-Yu/DIMP-Loss (footnote on page 1) and To ensure reproducibility, we provided the source code and generated dataset in supplementary materials...
Open Datasets	Yes	We assessed our proposed methods by comparing them with standard loss functions across several text classification benchmarks, including Financial Phrasebank (Financial) (Malo et al., 2014), irony detection (Tweet Irony) (Van Hee et al., 2018), and the MRPC dataset from GLUE (Wang et al., 2018).
Dataset Splits	Yes	In our experiments, we referred the large real-world data DP to the original training set from each benchmark and the small real-world data DP to the original development set, with the sizes from approximately 200 to 400, as shown in Table 5. ... For the Financial Phrasebank, ... we randomly divided it into training (70%), validation (5%), and testing (25%) sets like the previous work (Li et al., 2023). Table 5: Data size of each split: Train 3392, Dev 242, Test 1212 for Financial.
Hardware Specification	Yes	All experiments were conducted using Py Torch (Paszke et al., 2019) and Huggingface (for models and datasets) on V100 GPUs with 32GB memory.
Software Dependencies	No	All experiments were conducted using Py Torch (Paszke et al., 2019) and Huggingface (for models and datasets). For our experiments, we used a pre-trained BERT-base model (Devlin et al., 2019) from Huggingface s transformers library (Wolf et al., 2020). (The text mentions the software names and citations but does not provide specific version numbers for PyTorch or the Huggingface libraries used.)
Experiment Setup	Yes	We fine-tuned the model with hyperparameters selected from the following ranges: learning rate {6e-6, 6e-5}, epochs {5, 7}, and batch size {32, 64}. Other hyperparameters were set to the default values provided by Huggingface s trainer for text classification. The best checkpoint was selected based on the accuracy of the development set. We repeated each experiment with four random seeds. ... The downstream models and checkers were trained for 5 epochs with a batch size of 32. The batch size was set to 64 during the precalculating constant weights phase. The inner loop epochs were set to 1 for Sun Gen.