Sounding that Object: Interactive Object-Aware Image to Audio Generation

Authors: Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project site: https://tinglok. netlify.app/files/avobject/.
Researcher Affiliation Collaboration 1University of California, Berkeley 2Byte Dance Inc. Correspondence to: Tingle Li <EMAIL>.
Pseudocode No The paper describes the model architecture and processes in detail, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps formatted like code.
Open Source Code No We will release code and models upon acceptance.
Open Datasets Yes We use Audio Set (Gemmeke et al., 2017) as our primary data source... We then evaluate models on the Audio Caps (also a subset of Audio Set) (Kim et al., 2019)... VGG-Sound dataset (Chen et al., 2020)... Image Hear dataset (Sheffer & Adi, 2023)
Dataset Splits Yes We uniformly sample 48 hours across these categories for the test set, with the remaining used for training. Notably, there is no overlap between training and testing videos.
Hardware Specification No The paper does not specify any particular GPU models, CPU types, or other hardware used for running the experiments. It only mentions general training configurations.
Software Dependencies No The paper mentions several models and toolboxes used (e.g., 'pre-trained latent diffusion model (Liu et al., 2023)', 'CLAP audio encoder (Elizalde et al., 2023)', 'CLIP image encoder (Radford et al., 2021)', 'Hi Fi-GAN neural vocoder (Kong et al., 2020a)', 'PANNs model', 'Audio LDM-Eval toolbox', 'Open L3'), but it does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup Yes We apply a 512-point discrete Fourier transform with a frame length of 64 ms and a frame shift of 10 ms... The model is then trained using the Adam W optimizer (Loshchilov & Hutter, 2017) with a batch size of 64, a learning rate of 10 4, β1 = 0.95, β2 = 0.999, ϵ = 10 6, and a weight decay of 10 3 over 300 epochs... We implement a linear noise schedule consisting of N = 1000 diffusion steps, from β1 = 0.0015 to βN = 0.0195... The DDIM sampling method (Song et al., 2020) is used with 200 steps to facilitate efficient generation. At test time, we apply CFG with a guidance scale λ set to 2.0.