In-context Learning Demonstration Generation with Text Distillation
Authors: Wuyuqing Wang, Erkun Yang, Zilan Zhou, Cheng Deng
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted across ten prevalent text datasets demonstrate that our DDG method substantially outperforms existing state-of-the-art methodologies. Our code will be available at https://github.com/wwyq1/DDG. |
| Researcher Affiliation | Academia | Wuyuqing Wang , Erkun Yang , Zilan Zhou and Cheng Deng Xidian University, Xi an, China EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 The training process of DDG |
| Open Source Code | Yes | Our code will be available at https://github.com/wwyq1/DDG. |
| Open Datasets | Yes | We assess the efficacy of DDG on ten widely utilized text classification datasets, including eight short-text datasets: SST-2, SST-5, MNLI, QQP, Co LA, AGNews, QNLI, and CR; in addition to two complex long-text multi-tag datasets: BANKING77 and Go Emotions. |
| Dataset Splits | No | The paper describes how samples are used for In-Context Learning (ICL) in few-shot settings (e.g., "5-shot format", "extracted 50 samples from the test set as quary"). However, it does not explicitly provide information on the overall training, validation, and test splits for the ten datasets used. |
| Hardware Specification | No | The paper mentions specific LLMs like LLaMA-2-7B, Qwen-1.5-7B, Mistral-7B, and Long-LLaMA, and pre-trained models like RoBERTa-large, but does not specify the hardware (e.g., GPU/CPU models, memory) used to run or train these models or the DDG framework. |
| Software Dependencies | No | The paper mentions using the "GPT-3 model is selected as the basis of the generative model Gϕ" and "RoBERTa-large model is chosen as the pre-trained model for the calculative models", as well as the "Adam W optimizer", but it does not specify version numbers for any of these software components or other libraries. |
| Experiment Setup | Yes | In this paper, the GPT-3 model is selected as the basis of the generative model Gϕ, while the Ro BERTa-large model is chosen as the pre-trained model for the calculative models. We set the parameters of the nested loop algorithm for training the optimal generative model as follows: the total number of training sessions for the initial training with language modeling loss functions [Kaplan et al., 2020] is 50,000, and the total number of training sessions for fine-tuning the parameters of the generative model Gϕ is 10,000. The number of inner-loop steps is set to IL = 50, and the number of outer-loop steps is calculated as OL = total number of training sessions / number of inner-loop steps. The learning rate is established at 1.0 10 4, and the number of updating steps ε is set to 100. The mini-batch sizes for original and synthesized samples are set to N = 200 and M = 50. The warmup ratio for the entire process is set to 0.05, weight decay is set to 0.01, gradient clipping is set to 1.0, and the dropout ratio is set to 0.1. Finally, the generative model Gϕ was set to synthesize five samples simultaneously for each iteration, and each sample was generated with strict reference to the trainers setting. Moreover, we trained the calculative model parameters θ1 and θ2 separately on the original datasets and generated demonstrations five times in a loop with the learning rate of 1.0 10 3, the number of updating steps τ is set to 10. Simultaneously, the λ parameter of 0.99 was chosen to update the improver model Vθ3 under the teacher-student framework. For short-text datasets such as SST2, we empirically set k=58, p=0.97, temperature to 0.7, repetition penalty parameter value to 1.2; for long-text multi-tag datasets such as BANKING77, it is more appropriate to set k=10-12, p=0.95, temperature to 0.9, repetition penalty parameter value to 1.35. |