Sounding that Object: Interactive Object-Aware Image to Audio Generation
Authors: Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project site: https://tinglok. netlify.app/files/avobject/. |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley 2Byte Dance Inc. Correspondence to: Tingle Li <EMAIL>. |
| Pseudocode | No | The paper describes the model architecture and processes in detail, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps formatted like code. |
| Open Source Code | No | We will release code and models upon acceptance. |
| Open Datasets | Yes | We use Audio Set (Gemmeke et al., 2017) as our primary data source... We then evaluate models on the Audio Caps (also a subset of Audio Set) (Kim et al., 2019)... VGG-Sound dataset (Chen et al., 2020)... Image Hear dataset (Sheffer & Adi, 2023) |
| Dataset Splits | Yes | We uniformly sample 48 hours across these categories for the test set, with the remaining used for training. Notably, there is no overlap between training and testing videos. |
| Hardware Specification | No | The paper does not specify any particular GPU models, CPU types, or other hardware used for running the experiments. It only mentions general training configurations. |
| Software Dependencies | No | The paper mentions several models and toolboxes used (e.g., 'pre-trained latent diffusion model (Liu et al., 2023)', 'CLAP audio encoder (Elizalde et al., 2023)', 'CLIP image encoder (Radford et al., 2021)', 'Hi Fi-GAN neural vocoder (Kong et al., 2020a)', 'PANNs model', 'Audio LDM-Eval toolbox', 'Open L3'), but it does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We apply a 512-point discrete Fourier transform with a frame length of 64 ms and a frame shift of 10 ms... The model is then trained using the Adam W optimizer (Loshchilov & Hutter, 2017) with a batch size of 64, a learning rate of 10 4, β1 = 0.95, β2 = 0.999, ϵ = 10 6, and a weight decay of 10 3 over 300 epochs... We implement a linear noise schedule consisting of N = 1000 diffusion steps, from β1 = 0.0015 to βN = 0.0195... The DDIM sampling method (Song et al., 2020) is used with 200 steps to facilitate efficient generation. At test time, we apply CFG with a guidance scale λ set to 2.0. |