Contrastive Visual Data Augmentation

Authors: Yu Zhou, Bingxuan Li, Tang Mohan, Xiaomeng Jin, Te-Lin Wu, Kuan-Hao Huang, Heng Ji, Kai-Wei Chang, Nanyun Peng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments with LLa VA-Ne XT on the 3 datasets show Co DA significantly improves SOTA visual data augmentation strategies by 12.3% (Novel Species), 5.1% (SUN), and 6.0% (i Nat) absolute gains in accuracy.
Researcher Affiliation Academia 1UCLA 2UIUC 3TAMU.
Pseudocode No The paper describes the Co DA method in sections 3.1, 3.2, and 3.3 using descriptive text and flowcharts, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code and data at contrastive-visual-data-augmentation.github.io
Open Datasets Yes We show the effectiveness of Co DA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect Novel Species, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. Code and data at contrastive-visual-data-augmentation.github.io
Dataset Splits Yes The images are split into training, validation, and test sets. For each species, there are 5 training images, 15 validation images, and 15 test images.
Hardware Specification Yes The feature selection step is executed on an NVIDIA A100 GPU, processing features in approximately 2 hours. For synthetic image generation, we employ Stable Diffusion 3.5 Large, running on a single A100 GPU. Post-generation, we perform automated verification using LLa VA V1.6-34b, running on an A6000 GPU. The training runs on two NVIDIA A6000 GPUs, leveraging Deep Speed Zero-3 for distributed optimization and mixed precision (bf16) for efficiency. Inference runs on a single A6000 GPU with a batch size of 20, taking approximately 1 hour to complete.
Software Dependencies No The paper mentions several models and tools like GPT-4o-mini, Stable Diffusion 3.5 Large, Recraft V3, LLa VA V1.6-34b, and Deep Speed Zero-3, but does not provide specific version numbers for software libraries or programming languages used (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We train V1.6-34b with supervised fine-tuning (SFT) using Lo RA with rank 128 and alpha 256, optimizing memory efficiency while maintaining model expressiveness. The vision encoder is CLIP-Vi T-Large Patch14-336, with an MLP projector aligning visual and text features. We use a cosine learning rate scheduler with a 3% warmup ratio, training for 30 epochs with a batch size of 5 and a learning rate of 2e-4.