reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Measuring Diversity in Synthetic Datasets

Authors: Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing methods. ... We conduct experiments to verify the effectiveness of DCScore by examining correlation, computational cost, hyperparameter sensitivity, and further probing.
Researcher Affiliation	Collaboration	1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China 2School of Artificial Intelligence, Shenzhen University, Shenzhen, China 3School of Software Engineering, Sun Yat-sen University, Zhuhai, China 4Zhuhai Key Laboratory of Trusted Large Language Models, Zhuhai, China 5School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China 6Tencent AI Lab, Shenzhen, China 7Department of Computer Science, National University of Singapore, Singapore.
Pseudocode	No	The paper describes the DCScore method in Section 4.1 'DCScore: Measuring Diversity from a Classification Perspective' using textual explanations, mathematical formulations, and a figure, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at: https://github.com/ bluewhalelab/dcscore.
Open Datasets	Yes	Additionally, we utilize three publicly available existing datasets, including SST2 (Socher et al., 2013), Yelp (Zhang et al., 2015), and AG News (Zhang et al., 2015), and their Attr Prompt-augmented version (Yu et al., 2024).
Dataset Splits	Yes	In zero-shot or few-shot settings, we utilize the 70B model to generate three sub-datasets for the text classification task, corresponding to τg = {0.2, 0.7, 1.2}, respectively... Each sub-dataset contains 3,000 samples, and a context is employed to prompt the 70B model to generate five samples. To train text classification models on each sub-dataset, we randomly split 2,100 samples to the training set for each sub-dataset and gather the remaining 900 samples into the testing set across all three sub-datasets. Consequently, we construct a test set comprising 1,800 samples.
Hardware Specification	Yes	All experiments are conducted on 8 NVIDIA Tesla V100 GPU with 32GB of memory.
Software Dependencies	No	For three transformation-based methods, including DCScore, Vendi Score, and K-means Inertia, we employ unsup-simcse-bert-base-uncased (princeton nlp, 2021) as the weight of the embedding function. For all language models used to generate the dataset, we set the top-p and top-k parameters to 1 and -1, respectively. ... To train text classification models, we employ Ro BERTa (Liu, 2019) as the encoder ... We employ Lo RA (Hu et al., 2021) to finetune the encoder and the classifier ... We use Adam W (Loshchilov, 2017) ... This section lists specific models and techniques but does not provide explicit version numbers for software libraries or frameworks like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	For DCScore... we set τ in Eq. (4) to 1 for all other experiments. ... We set the number of clusters to 10 for all experiments. ... we fix the batch size of generating sample representation at 128 across all experiments. ... we fix the Lo RA scaling factor to 32 and the rank of the update matrices to 8. We use Adam W (Loshchilov, 2017) with an initial learning rate of 5e-5 and linear learning rate decay as our optimizer. Additionally, we set the batch size per GPU as 32 and epochs as 120. For all language models used to generate the dataset, we set the top-p and top-k parameters to 1 and -1, respectively. Additionally, we limit the maximum number of newly generated tokens to 100 for the text classification task and 30 for the story completion task. ... We vary τ within the range of {0.0001, 0.001, 0.1, 0.5, 1, 10}.