SoundBrush: Sound as a Brush for Visual Scene Editing

Authors: Kim Sung-Bin, Kim Jun-Seong, Junseok Ko, Yewon Kim, Tae-Hyun Oh

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the efficacy of our proposed Sound Brush by comparing it with existing sound-guided visual scene editing models (Yariv et al. 2023; Qin et al. 2023; Li, Singh, and Grover 2023). Unlike previous methods, Sound Brush can accurately insert sounding objects and edit the overall scenery to reflect the sound semantics, as shown in Fig. 1. Furthermore, by integrating with a novel view synthesis method (Mildenhall et al. 2020), our framework can be extended to edit 3D scenes, enabling sound-guided 3D scene editing. Our main contributions are summarized as follows: Proposing Sound Brush, a model that effectively incorporates auditory information to manipulate visual scenes. Generating a comprehensive dataset paired with sound cues and visual data, which facilitates the training of models for sound-guided visual scene editing. Demonstrating Sound Brush s ability to accurately insert objects or manipulate the overall visual scenes based on sound cues, including extensions to 3D scene editing. Experiments We validate the editing power of our proposed Sound Brush both qualitatively and quantitatively. We begin by outlining the experimental setup, which includes the dataset, metrics, and competing methods. We then present comparisons of sound-guided 2D visual scene editing between Sound Brush and existing methods. Finally, we demonstrate how Sound Brush can be extended to edit 3D visual scenes.
Researcher Affiliation Academia 1Department of Electrical Engineering, POSTECH, Korea 2Department of Statistics, Inha University, Korea 3Graduate School of Artificial Intelligence, POSTECH, Korea 4Institute for Convergence Research and Education in Advanced Technology, Yonsei University, Korea
Pseudocode No The paper describes the proposed approach and methodology using textual explanations and diagrams (Figure 2), but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an unambiguous statement that the authors are releasing their code, nor does it provide a direct link to a code repository.
Open Datasets Yes Initially, we extract sound categories from the VGGSound dataset (Chen et al. 2020), focusing on environmental sounds like Waterfall burbling and Thunder. ... We begin by extracting paired audio and images from the VGGSound dataset. ... We compare the editing capabilities of our method against three different methods: Audio Token (Yariv et al. 2023), Glue Gen (Qin et al. 2023), and Instruct Any2Pix (Li, Singh, and Grover 2023). Audio Token and Glue Gen are originally targeted for generating images from sound and have demonstrated significant image generation performance. To adapt these models for image editing, we employ a training-free Plug-and-Play (Pn P) method (Tumanyan et al. 2023) that enables them to edit visual scenes with sound input. Additionally, as Glue Gen is initially trained with Urban Sound8K (Salamon, Jacoby, and Bello 2014), we fine-tune this model using the VGGSound dataset to ensure a fair comparison.
Dataset Splits No In total, we construct a dataset consisting of 83,614 pairs, with 27,056 fully synthetic and 56,558 involving real data. The example pairs are shown in the Fig. 2 (a)-row 3. ... Dataset. We construct the evaluation dataset following the previously described dataset construction pipeline. All the audio files are sourced from VGGSound (Chen et al. 2020), which is an audio-visual dataset containing around 200K videos from 309 different sound categories. We select 20 categories from these and use the provided test splits for constructing the evaluation dataset.
Hardware Specification No The paper does not explicitly mention any specific hardware details such as GPU models, CPU types, or cloud computing instances used for running its experiments.
Software Dependencies No The paper mentions various models and frameworks like Latent Diffusion Model (LDM), GPT-4, Prompt-to-Prompt, Image Bind, La Ma, Instruct Pix2Pix, CLAP audio encoder, LoRA, and Inception V3. However, it does not provide specific version numbers for these or for any underlying software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup Yes Ablation study. We conduct a series of experiments to verify our design choices as detailed in Table 2. We evaluate the impact of varying the number of audio tokens in the mapping network and the effect of applying the loss function specified in Eq. (2). We find that using a single audio token (A) does not contain sufficient information for sound-guided image editing. Increasing the number of tokens to five (B) results in significant improvements; however, further increasing to ten (C) begins to degrade performance. Additionally, we validate Eq. (2) by comparing results with configurations (B) and (C) and observe that applying Eq. (2) stabilizes model training and leads to improved performance. These insights have guided us to our final model configuration (C).