Contrastive Visual Data Augmentation
Authors: Yu Zhou, Bingxuan Li, Tang Mohan, Xiaomeng Jin, Te-Lin Wu, Kuan-Hao Huang, Heng Ji, Kai-Wei Chang, Nanyun Peng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments with LLa VA-Ne XT on the 3 datasets show Co DA significantly improves SOTA visual data augmentation strategies by 12.3% (Novel Species), 5.1% (SUN), and 6.0% (i Nat) absolute gains in accuracy. |
| Researcher Affiliation | Academia | 1UCLA 2UIUC 3TAMU. |
| Pseudocode | No | The paper describes the Co DA method in sections 3.1, 3.2, and 3.3 using descriptive text and flowcharts, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code and data at contrastive-visual-data-augmentation.github.io |
| Open Datasets | Yes | We show the effectiveness of Co DA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect Novel Species, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. Code and data at contrastive-visual-data-augmentation.github.io |
| Dataset Splits | Yes | The images are split into training, validation, and test sets. For each species, there are 5 training images, 15 validation images, and 15 test images. |
| Hardware Specification | Yes | The feature selection step is executed on an NVIDIA A100 GPU, processing features in approximately 2 hours. For synthetic image generation, we employ Stable Diffusion 3.5 Large, running on a single A100 GPU. Post-generation, we perform automated verification using LLa VA V1.6-34b, running on an A6000 GPU. The training runs on two NVIDIA A6000 GPUs, leveraging Deep Speed Zero-3 for distributed optimization and mixed precision (bf16) for efficiency. Inference runs on a single A6000 GPU with a batch size of 20, taking approximately 1 hour to complete. |
| Software Dependencies | No | The paper mentions several models and tools like GPT-4o-mini, Stable Diffusion 3.5 Large, Recraft V3, LLa VA V1.6-34b, and Deep Speed Zero-3, but does not provide specific version numbers for software libraries or programming languages used (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We train V1.6-34b with supervised fine-tuning (SFT) using Lo RA with rank 128 and alpha 256, optimizing memory efficiency while maintaining model expressiveness. The vision encoder is CLIP-Vi T-Large Patch14-336, with an MLP projector aligning visual and text features. We use a cosine learning rate scheduler with a 3% warmup ratio, training for 30 epochs with a batch size of 5 and a learning rate of 2e-4. |