reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sounding that Object: Interactive Object-Aware Image to Audio Generation

Authors: Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project site: https://tinglok. netlify.app/files/avobject/.
Researcher Affiliation	Collaboration	1University of California, Berkeley 2Byte Dance Inc. Correspondence to: Tingle Li <EMAIL>.
Pseudocode	No	The paper describes the model architecture and processes in detail, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps formatted like code.
Open Source Code	No	We will release code and models upon acceptance.
Open Datasets	Yes	We use Audio Set (Gemmeke et al., 2017) as our primary data source... We then evaluate models on the Audio Caps (also a subset of Audio Set) (Kim et al., 2019)... VGG-Sound dataset (Chen et al., 2020)... Image Hear dataset (Sheffer & Adi, 2023)
Dataset Splits	Yes	We uniformly sample 48 hours across these categories for the test set, with the remaining used for training. Notably, there is no overlap between training and testing videos.
Hardware Specification	No	The paper does not specify any particular GPU models, CPU types, or other hardware used for running the experiments. It only mentions general training configurations.
Software Dependencies	No	The paper mentions several models and toolboxes used (e.g., 'pre-trained latent diffusion model (Liu et al., 2023)', 'CLAP audio encoder (Elizalde et al., 2023)', 'CLIP image encoder (Radford et al., 2021)', 'Hi Fi-GAN neural vocoder (Kong et al., 2020a)', 'PANNs model', 'Audio LDM-Eval toolbox', 'Open L3'), but it does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup	Yes	We apply a 512-point discrete Fourier transform with a frame length of 64 ms and a frame shift of 10 ms... The model is then trained using the Adam W optimizer (Loshchilov & Hutter, 2017) with a batch size of 64, a learning rate of 10 4, β1 = 0.95, β2 = 0.999, ϵ = 10 6, and a weight decay of 10 3 over 300 epochs... We implement a linear noise schedule consisting of N = 1000 diffusion steps, from β1 = 0.0015 to βN = 0.0195... The DDIM sampling method (Song et al., 2020) is used with 200 steps to facilitate efficient generation. At test time, we apply CFG with a guidance scale λ set to 2.0.