Instruct Where the Model Fails: Generative Data Augmentation via Guided Self-contrastive Fine-tuning

Authors: Weijian Ma, Ruoxin Chen, Keyue Zhang, Shuang Wu, Shouhong Ding

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on few-shot class incremental learning show that our instruction-guided finetuning strategy consistently assists the downstream model with higher classification accuracy compared to generative data augmentation baselines such as Stable Diffusion and GPT-4o, and state-of-the-art non-generative strategies. Our experimental results on few-shot class incremental learning (FSCIL) demonstrate that our instruction-guided finetuning approach consistently enhances the downstream model s classification accuracy throughout the continual learning process. This improvement surpasses the performance achieved by generative data augmentation methods, including Stable Diffusion and GPT-4o, as well as state-of-the-art FSCIL strategies. Experiment Settings Dataset and Evaluation Metrics. We conduct our experiment under the setting of (Tao et al. 2020) and (Park, Song, and Park 2024) for fair comparison. The method is evaluated with state-of-the-art method on following datasets: mini Image Net (Ravi and Larochelle 2017), CUB200 (Wah et al. 2011) and CIFAR-100 (Krizhevsky 2009). Ablation Study We perform two ablation studies on mini Imagenet dataset and CUB200 dataset to justify our design of finetuning on both semantic level and in details.
Researcher Affiliation Collaboration Weijian Ma1*, Ruoxin Chen2, Keyue Zhang2, Shuang Wu2, Shouhong Ding2 1School of Computer Science, Fudan University 2Youtu Lab, Tencent EMAIL
Pseudocode No The paper describes the method conceptually with diagrams (Figure 1 and 2) and prose, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about releasing the source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes Dataset and Evaluation Metrics. We conduct our experiment under the setting of (Tao et al. 2020) and (Park, Song, and Park 2024) for fair comparison. The method is evaluated with state-of-the-art method on following datasets: mini Image Net (Ravi and Larochelle 2017), CUB200 (Wah et al. 2011) and CIFAR-100 (Krizhevsky 2009).
Dataset Splits Yes The split configurations in all datasets are shown in Table 2 which remains the same as the prevailing settings in (Tao et al. 2020) and (Park, Song, and Park 2024). ... Table 2: Configuration settings for FSCIL benchmarks on CUB-200, CIFAR-100, and mini Image Net. CUB200 Base 100, Incremental 10-way 5-shot, # of sessions 1+10; CIFAR-100 Base 60, Incremental 5-way 5-shot, # of sessions 1+8; mini Image Net Base 60, Incremental 5-way 5-shot, # of sessions 1+8
Hardware Specification Yes The method is trained on 8 H100-80G GPUs.
Software Dependencies No The paper mentions specific models like 'Stable Diffusion v1.5', 'GPT4o', and 'VIT-B/16', but does not provide specific version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA) that would be needed for replication.
Experiment Setup Yes Implementation Details. We use Stable Diffusion v1.5 as our diffusion augmentor with the CFG guidance scale as 2, following the configuration of (Sarıyıldız et al. 2023). The total diffusion steps is set to 20. The VLM we utilize is GPT4o in the main experiment. The initial prompt for SD 1.5 remains fixed as A picture of a [category]. For downstream models, we used a VIT-B/16 (Dosovitskiy et al. 2021) pretrained on Image Net-21K (Deng et al. 2009) for ours and other comparative methods. The learning rate of the downstream model is set as 2e-4, using the Adam optimizer and cosine annealing learning rate scheduler. For each image in the training set, we augment the number of images to the original size in each epoch for augmentation. For both base session and incremental sessions, we train our network until convergence.