Audio-Visual Dataset Distillation
Authors: Saksham Singh Kushwaha, Siva Sai Nagender Vasireddy, Kai Wang, Yapeng Tian
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive audio-visual classification and retrieval experiments on four audio-visual datasets, AVE, MUSIC-21, VGGSound, and VGGSound-10K, demonstrate the effectiveness of our proposed matching approaches and validate the benefits of audio-visual integration with condensed data. |
| Researcher Affiliation | Academia | Saksham Singh Kushwaha EMAIL Department of Computer Science The University of Texas at Dallas Siva Sai Nagender Vasireddy EMAIL Department of Computer Science The University of Texas at Dallas Kai Wang EMAIL Institutes of Data Science & School of Computing National University of Singapore Yapeng Tian EMAIL Department of Computer Science The University of Texas at Dallas |
| Pseudocode | Yes | Algorithm 1 Audio-Visual Dataset Distillation |
| Open Source Code | Yes | Source code and pre-trained model: https://github.com/sakshamsingh1/AVDD. |
| Open Datasets | Yes | Our extensive experiments on four widely used audio-visual datasets: AVE (Tian et al., 2018), MUSIC-21 (Zhao et al., 2019), VGGSound (Chen et al., 2020c), and VGGS-10k (Chen et al., 2020c) support the following findings: effective joint audio-visual integration outperforms unimodal performance in audio-visual dataset distillation, implicit cross-matching and cross-modal gap matching improves the vanilla audio-visual distribution matching by distilling the audio-visual alignment into synthetic data, herding initialization & factor technique further helps improve audio-visual distillation. |
| Dataset Splits | Yes | For exploratory analyses and experimental setup of this novel task, we randomly selected a subset of 10 classes from VGGSound with 8808 train videos and 444 test videos. This subset is referred to as VGGS-10k. MUSIC-21: (Zhao et al., 2019) comprises synchronized audio-visual recordings featuring 21 distinct musical instruments. For our study, we focus exclusively on the solo performances subset and segment each video clip into discrete, non-overlapping windows of one second. We randomly partition this subset into train/val/test splits of 146,908/7,103/42,440 samples, respectively. AVE: (Tian et al., 2018) consists of 4,143 video clips spanning over 28 event categories. We segment each clip into non-overlapping one-second windows aligned with the synchronized annotations, resulting in train/val/test splits of 27,726/3,288/3,305 samples, respectively. |
| Hardware Specification | Yes | Compute resources. We test our experiments on A5000 and A6000 GPUs with 24 GB memory and 48 GB memory respectively. |
| Software Dependencies | No | The paper mentions using Conv Net architecture and different fusion strategies but does not provide specific version numbers for software dependencies like Python, PyTorch, CUDA, or other libraries. |
| Experiment Setup | Yes | We use a learning rate of 0.2 and an SGD optimizer with a momentum of 0.5. Our synthetic data is initialized with Herding-selected audio-visual (AV) pairs and trained with a batch size of 128. For IPC 1 and 10, we set λICM and λCGM to 10, and for IPC 20, we set them to 20. Audio is sampled at 11k Hz and transformed into 128 56 log mel-spectrograms. During evaluation, the initial learning rate for the audio model is kept 1e-3, the visual part is kept 1e-4 and for the classifier layers are kept at 1e-4. The learning rates are lowered by multiplying by 0.1 after every 10 epochs. |