reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Audio-Visual Dataset Distillation

Authors: Saksham Singh Kushwaha, Siva Sai Nagender Vasireddy, Kai Wang, Yapeng Tian

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive audio-visual classification and retrieval experiments on four audio-visual datasets, AVE, MUSIC-21, VGGSound, and VGGSound-10K, demonstrate the effectiveness of our proposed matching approaches and validate the benefits of audio-visual integration with condensed data.
Researcher Affiliation	Academia	Saksham Singh Kushwaha EMAIL Department of Computer Science The University of Texas at Dallas Siva Sai Nagender Vasireddy EMAIL Department of Computer Science The University of Texas at Dallas Kai Wang EMAIL Institutes of Data Science & School of Computing National University of Singapore Yapeng Tian EMAIL Department of Computer Science The University of Texas at Dallas
Pseudocode	Yes	Algorithm 1 Audio-Visual Dataset Distillation
Open Source Code	Yes	Source code and pre-trained model: https://github.com/sakshamsingh1/AVDD.
Open Datasets	Yes	Our extensive experiments on four widely used audio-visual datasets: AVE (Tian et al., 2018), MUSIC-21 (Zhao et al., 2019), VGGSound (Chen et al., 2020c), and VGGS-10k (Chen et al., 2020c) support the following findings: effective joint audio-visual integration outperforms unimodal performance in audio-visual dataset distillation, implicit cross-matching and cross-modal gap matching improves the vanilla audio-visual distribution matching by distilling the audio-visual alignment into synthetic data, herding initialization & factor technique further helps improve audio-visual distillation.
Dataset Splits	Yes	For exploratory analyses and experimental setup of this novel task, we randomly selected a subset of 10 classes from VGGSound with 8808 train videos and 444 test videos. This subset is referred to as VGGS-10k. MUSIC-21: (Zhao et al., 2019) comprises synchronized audio-visual recordings featuring 21 distinct musical instruments. For our study, we focus exclusively on the solo performances subset and segment each video clip into discrete, non-overlapping windows of one second. We randomly partition this subset into train/val/test splits of 146,908/7,103/42,440 samples, respectively. AVE: (Tian et al., 2018) consists of 4,143 video clips spanning over 28 event categories. We segment each clip into non-overlapping one-second windows aligned with the synchronized annotations, resulting in train/val/test splits of 27,726/3,288/3,305 samples, respectively.
Hardware Specification	Yes	Compute resources. We test our experiments on A5000 and A6000 GPUs with 24 GB memory and 48 GB memory respectively.
Software Dependencies	No	The paper mentions using Conv Net architecture and different fusion strategies but does not provide specific version numbers for software dependencies like Python, PyTorch, CUDA, or other libraries.
Experiment Setup	Yes	We use a learning rate of 0.2 and an SGD optimizer with a momentum of 0.5. Our synthetic data is initialized with Herding-selected audio-visual (AV) pairs and trained with a batch size of 128. For IPC 1 and 10, we set λICM and λCGM to 10, and for IPC 20, we set them to 20. Audio is sampled at 11k Hz and transformed into 128 56 log mel-spectrograms. During evaluation, the initial learning rate for the audio model is kept 1e-3, the visual part is kept 1e-4 and for the classifier layers are kept at 1e-4. The learning rates are lowered by multiplying by 0.1 after every 10 epochs.