CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection

Authors: Byeongchan Lee, John Won, Seunghyun Lee, Jinwoo Shin

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on benchmark datasets (MVTec-AD, Vis A) demonstrate that CLIPFUSION consistently outperforms baseline methods in both anomaly segmentation and classification under both zero-shot and few-shot settings. We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection, providing a scalable solution for real-world applications.
Researcher Affiliation Academia Byeongchan Lee EMAIL Korea Advanced Institute of Science and Technology (KAIST) John Won EMAIL Korea Advanced Institute of Science and Technology (KAIST) Seunghyun Lee EMAIL Korea Advanced Institute of Science and Technology (KAIST) Jinwoo Shin EMAIL Korea Advanced Institute of Science and Technology (KAIST)
Pseudocode Yes Algorithm 1 CLIPFUSION for Anomaly Segmentation Algorithm 2 CLIPFUSION for Anomaly Classification
Open Source Code No The paper does not provide an explicit statement of code release or a direct link to a code repository for the methodology described. It mentions using third-party libraries like Open CLIP and the diffusers library, but not its own implementation.
Open Datasets Yes We use the MVTec-AD (Bergmann et al., 2019) and Vis A (Zou et al., 2022) datasets, which are mainly used as benchmark datasets in anomaly detection.
Dataset Splits Yes Each dataset consists of several object categories, and each object category is divided into a training set containing only normal images and a test set containing a mixture of normal and abnormal images. Query images are from the test set, and reference images are sampled from the training set. For the k-shot case, k images are used to construct feature memory banks.
Hardware Specification Yes We report the inference latency measured on an NVIDIA A100 GPU (40 GB) in Table 7, with a batch size of 1.
Software Dependencies Yes For Patch CLIP, we employ Open CLIP (Ilharco et al., 2021; Cherti et al., 2023; Schuhmann et al., 2022)... For Map Diff, we utilize the pretrained Stable Diffusion v2 model (Rombach et al., 2022)... For Map Diff, we use bilinear interpolation to resize images to a resolution of 512 before feeding into a customized Stable Diffusion Inpainting pipeline from the diffusers library (von Platen et al., 2022).
Experiment Setup Yes We set α = 0.25 in Equation 4 for segmentation and α = 0.75 in Equation 9 for classification to reflect the different importance of the models in each task. For Patch CLIP, we use the same pre-processing pipeline as in Win CLIP (Jeong et al., 2023)... First, bicubic interpolation is used to resize images to a resolution of 240... channel-wise standardization is applied with the precomputed values of (0.48145466, 0.4578275, 0.40821073) for the mean and (0.26862954, 0.26130258, 0.27577711) for the standard deviation. For Map Diff, we use bilinear interpolation to resize images to a resolution of 512... The prompts are derived from those used to train CLIP for the Image Net dataset... Map Diff uses the prompt a close-up cropped png photo of a [object] with [state]... Table 5 highlights the impact of matching timestep and block numbers when calculating M Diff,V (Equation 8).