reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

In-context Learning Demonstration Generation with Text Distillation

Authors: Wuyuqing Wang, Erkun Yang, Zilan Zhou, Cheng Deng

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments conducted across ten prevalent text datasets demonstrate that our DDG method substantially outperforms existing state-of-the-art methodologies. Our code will be available at https://github.com/wwyq1/DDG.
Researcher Affiliation	Academia	Wuyuqing Wang , Erkun Yang , Zilan Zhou and Cheng Deng Xidian University, Xi an, China EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 The training process of DDG
Open Source Code	Yes	Our code will be available at https://github.com/wwyq1/DDG.
Open Datasets	Yes	We assess the efficacy of DDG on ten widely utilized text classification datasets, including eight short-text datasets: SST-2, SST-5, MNLI, QQP, Co LA, AGNews, QNLI, and CR; in addition to two complex long-text multi-tag datasets: BANKING77 and Go Emotions.
Dataset Splits	No	The paper describes how samples are used for In-Context Learning (ICL) in few-shot settings (e.g., "5-shot format", "extracted 50 samples from the test set as quary"). However, it does not explicitly provide information on the overall training, validation, and test splits for the ten datasets used.
Hardware Specification	No	The paper mentions specific LLMs like LLaMA-2-7B, Qwen-1.5-7B, Mistral-7B, and Long-LLaMA, and pre-trained models like RoBERTa-large, but does not specify the hardware (e.g., GPU/CPU models, memory) used to run or train these models or the DDG framework.
Software Dependencies	No	The paper mentions using the "GPT-3 model is selected as the basis of the generative model Gϕ" and "RoBERTa-large model is chosen as the pre-trained model for the calculative models", as well as the "Adam W optimizer", but it does not specify version numbers for any of these software components or other libraries.
Experiment Setup	Yes	In this paper, the GPT-3 model is selected as the basis of the generative model Gϕ, while the Ro BERTa-large model is chosen as the pre-trained model for the calculative models. We set the parameters of the nested loop algorithm for training the optimal generative model as follows: the total number of training sessions for the initial training with language modeling loss functions [Kaplan et al., 2020] is 50,000, and the total number of training sessions for fine-tuning the parameters of the generative model Gϕ is 10,000. The number of inner-loop steps is set to IL = 50, and the number of outer-loop steps is calculated as OL = total number of training sessions / number of inner-loop steps. The learning rate is established at 1.0 10 4, and the number of updating steps ε is set to 100. The mini-batch sizes for original and synthesized samples are set to N = 200 and M = 50. The warmup ratio for the entire process is set to 0.05, weight decay is set to 0.01, gradient clipping is set to 1.0, and the dropout ratio is set to 0.1. Finally, the generative model Gϕ was set to synthesize five samples simultaneously for each iteration, and each sample was generated with strict reference to the trainers setting. Moreover, we trained the calculative model parameters θ1 and θ2 separately on the original datasets and generated demonstrations five times in a loop with the learning rate of 1.0 10 3, the number of updating steps τ is set to 10. Simultaneously, the λ parameter of 0.99 was chosen to update the improver model Vθ3 under the teacher-student framework. For short-text datasets such as SST2, we empirically set k=58, p=0.97, temperature to 0.7, repetition penalty parameter value to 1.2; for long-text multi-tag datasets such as BANKING77, it is more appropriate to set k=10-12, p=0.95, temperature to 0.9, repetition penalty parameter value to 1.35.