Audio-Visual Dataset Distillation

Authors: Saksham Singh Kushwaha, Siva Sai Nagender Vasireddy, Kai Wang, Yapeng Tian

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive audio-visual classification and retrieval experiments on four audio-visual datasets, AVE, MUSIC-21, VGGSound, and VGGSound-10K, demonstrate the effectiveness of our proposed matching approaches and validate the benefits of audio-visual integration with condensed data.
Researcher Affiliation Academia Saksham Singh Kushwaha EMAIL Department of Computer Science The University of Texas at Dallas Siva Sai Nagender Vasireddy EMAIL Department of Computer Science The University of Texas at Dallas Kai Wang EMAIL Institutes of Data Science & School of Computing National University of Singapore Yapeng Tian EMAIL Department of Computer Science The University of Texas at Dallas
Pseudocode Yes Algorithm 1 Audio-Visual Dataset Distillation
Open Source Code Yes Source code and pre-trained model: https://github.com/sakshamsingh1/AVDD.
Open Datasets Yes Our extensive experiments on four widely used audio-visual datasets: AVE (Tian et al., 2018), MUSIC-21 (Zhao et al., 2019), VGGSound (Chen et al., 2020c), and VGGS-10k (Chen et al., 2020c) support the following findings: effective joint audio-visual integration outperforms unimodal performance in audio-visual dataset distillation, implicit cross-matching and cross-modal gap matching improves the vanilla audio-visual distribution matching by distilling the audio-visual alignment into synthetic data, herding initialization & factor technique further helps improve audio-visual distillation.
Dataset Splits Yes For exploratory analyses and experimental setup of this novel task, we randomly selected a subset of 10 classes from VGGSound with 8808 train videos and 444 test videos. This subset is referred to as VGGS-10k. MUSIC-21: (Zhao et al., 2019) comprises synchronized audio-visual recordings featuring 21 distinct musical instruments. For our study, we focus exclusively on the solo performances subset and segment each video clip into discrete, non-overlapping windows of one second. We randomly partition this subset into train/val/test splits of 146,908/7,103/42,440 samples, respectively. AVE: (Tian et al., 2018) consists of 4,143 video clips spanning over 28 event categories. We segment each clip into non-overlapping one-second windows aligned with the synchronized annotations, resulting in train/val/test splits of 27,726/3,288/3,305 samples, respectively.
Hardware Specification Yes Compute resources. We test our experiments on A5000 and A6000 GPUs with 24 GB memory and 48 GB memory respectively.
Software Dependencies No The paper mentions using Conv Net architecture and different fusion strategies but does not provide specific version numbers for software dependencies like Python, PyTorch, CUDA, or other libraries.
Experiment Setup Yes We use a learning rate of 0.2 and an SGD optimizer with a momentum of 0.5. Our synthetic data is initialized with Herding-selected audio-visual (AV) pairs and trained with a batch size of 128. For IPC 1 and 10, we set λICM and λCGM to 10, and for IPC 20, we set them to 20. Audio is sampled at 11k Hz and transformed into 128 56 log mel-spectrograms. During evaluation, the initial learning rate for the audio model is kept 1e-3, the visual part is kept 1e-4 and for the classifier layers are kept at 1e-4. The learning rates are lowered by multiplying by 0.1 after every 10 epochs.