reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contrastive Visual Data Augmentation

Authors: Yu Zhou, Bingxuan Li, Tang Mohan, Xiaomeng Jin, Te-Lin Wu, Kuan-Hao Huang, Heng Ji, Kai-Wei Chang, Nanyun Peng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments with LLa VA-Ne XT on the 3 datasets show Co DA significantly improves SOTA visual data augmentation strategies by 12.3% (Novel Species), 5.1% (SUN), and 6.0% (i Nat) absolute gains in accuracy.
Researcher Affiliation	Academia	1UCLA 2UIUC 3TAMU.
Pseudocode	No	The paper describes the Co DA method in sections 3.1, 3.2, and 3.3 using descriptive text and flowcharts, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code and data at contrastive-visual-data-augmentation.github.io
Open Datasets	Yes	We show the effectiveness of Co DA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect Novel Species, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. Code and data at contrastive-visual-data-augmentation.github.io
Dataset Splits	Yes	The images are split into training, validation, and test sets. For each species, there are 5 training images, 15 validation images, and 15 test images.
Hardware Specification	Yes	The feature selection step is executed on an NVIDIA A100 GPU, processing features in approximately 2 hours. For synthetic image generation, we employ Stable Diffusion 3.5 Large, running on a single A100 GPU. Post-generation, we perform automated verification using LLa VA V1.6-34b, running on an A6000 GPU. The training runs on two NVIDIA A6000 GPUs, leveraging Deep Speed Zero-3 for distributed optimization and mixed precision (bf16) for efficiency. Inference runs on a single A6000 GPU with a batch size of 20, taking approximately 1 hour to complete.
Software Dependencies	No	The paper mentions several models and tools like GPT-4o-mini, Stable Diffusion 3.5 Large, Recraft V3, LLa VA V1.6-34b, and Deep Speed Zero-3, but does not provide specific version numbers for software libraries or programming languages used (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We train V1.6-34b with supervised fine-tuning (SFT) using Lo RA with rank 128 and alpha 256, optimizing memory efficiency while maintaining model expressiveness. The vision encoder is CLIP-Vi T-Large Patch14-336, with an MLP projector aligning visual and text features. We use a cosine learning rate scheduler with a 3% warmup ratio, training for 30 epochs with a batch size of 5 and a learning rate of 2e-4.