MindPainter: Efficient Brain-Conditioned Painting of Natural Images via Cross-Modal Self-Supervised Learning

Authors: Muzhou Yu, Shuyun Lin, Hongwei Yan, Kaisheng Ma

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment Dataset and Implementation The NSD dataset (Allen 2022) is a large-scale f MRI dataset collecting the brain responses of human visual perception when viewing natural scenes from MS-COCO (Lin et al. 2014). We utilize it for Pseudo Brain Signal Optimization. Here we develop subject-specific models for each of the four subjects in NSD. We present the results of subject 1 in this paper. In fine-tuning, we select 10,000 images from Open Images dataset in proportion to the original distribution of 600 categories. For inference illustration, the source images are collected from Open Images and free-to-use images from the Bing website, and f MRI from the NSD test set is applied as the condition. In the user study, we randomly pair 100 source images from Open Images with 100 f MRI from NSD as our test benchmark. Qualitative Illustration We apply Mind Painter to various painting scenarios, including inpainting and outpainting tasks with arbitrary masks created by users. As shown in Figure 4, Mind Painter enables the seamless integration of implicit brain signal semantics into natural image edits. Comparisons Qualitative Analysis We adopt two state-of-the-art methods, Mind Eye and Paint by Example as the brain decoder and image-based editing strategy to get the combination as the baseline in our paper. In Figure 6, we qualitatively compare Mind Painter with the baseline. User Study To present the quantitative analysis of the painting results, we conduct the human perceptual evaluation study on 20 participants, who are divided into 5 groups. Each group is evaluated on 20 pairs of comparison, including our results and the baseline results. Participants are asked to score on three perspectives independently: the generated image quality, the alignment to the semantics of the brain signal, and the consistency. In total, we collected 1200 answers, whose results are summarized in Table 1. Ablation Study As shown in Figure 7 and Table 1, we conduct ablations with user study on three key technologies introduced in our method to compare their effectiveness: diffusion prior, PBG, and Multi-Mask Generation Policy.
Researcher Affiliation Academia Muzhou Yu1*, Shuyun Lin2*, Hongwei Yan2, Kaisheng Ma2 1Xi an Jiaotong University 2Tsinghua University EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methods in regular paragraph text and uses Figure 3 to illustrate the training and inference overview of Mind Painter, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code No The code and the link to the extended version will be available on Git Hub.
Open Datasets Yes The NSD dataset (Allen 2022) is a large-scale f MRI dataset collecting the brain responses of human visual perception when viewing natural scenes from MS-COCO (Lin et al. 2014). We utilize it for Pseudo Brain Signal Optimization. Here we develop subject-specific models for each of the four subjects in NSD. We present the results of subject 1 in this paper. In fine-tuning, we select 10,000 images from Open Images dataset in proportion to the original distribution of 600 categories. For inference illustration, the source images are collected from Open Images and free-to-use images from the Bing website, and f MRI from the NSD test set is applied as the condition. In the user study, we randomly pair 100 source images from Open Images with 100 f MRI from NSD as our test benchmark.
Dataset Splits Yes In fine-tuning, we select 10,000 images from Open Images dataset in proportion to the original distribution of 600 categories. For inference illustration, the source images are collected from Open Images and free-to-use images from the Bing website, and f MRI from the NSD test set is applied as the condition. In the user study, we randomly pair 100 source images from Open Images with 100 f MRI from NSD as our test benchmark.
Hardware Specification No No specific hardware details (such as GPU models, CPU types, or memory) used for running experiments are mentioned in the paper.
Software Dependencies No The paper mentions using specific models and frameworks like 'CLIP ViT/L-14 (Radford et al. 2021)' and a 'diffusion model pre-trained on image-driven editing for initialization (Yang et al. 2023)', but it does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes To ensure steady convergence, we fix the PBG in conditional diffusion fine-tuning. Following the conditional probability optimization of the diffusion model, the training objective function can be formulated as: Lcond t = Ex,ϵ N (0,1),t h ϵ ϵθ(xt, t, A(B , σ)) 2 2 i . (5) where t = 1, . . . , T and xt is obtained by corrupting the masked image xt 1 with Gaussian noise. ϵθ(xt, t) is a set of denoising functions that are usually implemented as UNets, referring to (Rombach et al. 2022) for more detailed descriptions of stable diffusion. Multi-Mask Generation Policy In fine-tuning, we introduce Multi-Mask Generation Policy to increase training complexity, thereby enhancing the model s robustness and generalization. This enables our model to better adapt to different masked regions for image painting. We employ three specific masking strategies: inpaint, outpaint, and random masking. Specifically, based on the bounding box provided by xs, inpaint masking treats the image patch within the bounding box as the condition, outpaint masking uses the patch outside the bounding box. Drawing on the method from LAMA (Suvorov et al. 2022b), random masking is applied to xs to obtain masks of random shapes, positions, and quantities. Inspired from (Yang et al. 2023), we apply mask shape augmentation to all the aforementioned masks for better adaptation to user-masked painting scenarios. Note that the PBG, which is trained for mapping the CLIP-embedded image into the brain modal, is capable of dealing with different masking image patches. During training, we utilize a probabilistic selection to choose a masking strategy for the current training image. The manually-set probabilities for the inpaint, outpaint, and random masking strategy are 0.5, 0.3, and 0.2, respectively.