CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection
Authors: Byeongchan Lee, John Won, Seunghyun Lee, Jinwoo Shin
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on benchmark datasets (MVTec-AD, Vis A) demonstrate that CLIPFUSION consistently outperforms baseline methods in both anomaly segmentation and classification under both zero-shot and few-shot settings. We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection, providing a scalable solution for real-world applications. |
| Researcher Affiliation | Academia | Byeongchan Lee EMAIL Korea Advanced Institute of Science and Technology (KAIST) John Won EMAIL Korea Advanced Institute of Science and Technology (KAIST) Seunghyun Lee EMAIL Korea Advanced Institute of Science and Technology (KAIST) Jinwoo Shin EMAIL Korea Advanced Institute of Science and Technology (KAIST) |
| Pseudocode | Yes | Algorithm 1 CLIPFUSION for Anomaly Segmentation Algorithm 2 CLIPFUSION for Anomaly Classification |
| Open Source Code | No | The paper does not provide an explicit statement of code release or a direct link to a code repository for the methodology described. It mentions using third-party libraries like Open CLIP and the diffusers library, but not its own implementation. |
| Open Datasets | Yes | We use the MVTec-AD (Bergmann et al., 2019) and Vis A (Zou et al., 2022) datasets, which are mainly used as benchmark datasets in anomaly detection. |
| Dataset Splits | Yes | Each dataset consists of several object categories, and each object category is divided into a training set containing only normal images and a test set containing a mixture of normal and abnormal images. Query images are from the test set, and reference images are sampled from the training set. For the k-shot case, k images are used to construct feature memory banks. |
| Hardware Specification | Yes | We report the inference latency measured on an NVIDIA A100 GPU (40 GB) in Table 7, with a batch size of 1. |
| Software Dependencies | Yes | For Patch CLIP, we employ Open CLIP (Ilharco et al., 2021; Cherti et al., 2023; Schuhmann et al., 2022)... For Map Diff, we utilize the pretrained Stable Diffusion v2 model (Rombach et al., 2022)... For Map Diff, we use bilinear interpolation to resize images to a resolution of 512 before feeding into a customized Stable Diffusion Inpainting pipeline from the diffusers library (von Platen et al., 2022). |
| Experiment Setup | Yes | We set α = 0.25 in Equation 4 for segmentation and α = 0.75 in Equation 9 for classification to reflect the different importance of the models in each task. For Patch CLIP, we use the same pre-processing pipeline as in Win CLIP (Jeong et al., 2023)... First, bicubic interpolation is used to resize images to a resolution of 240... channel-wise standardization is applied with the precomputed values of (0.48145466, 0.4578275, 0.40821073) for the mean and (0.26862954, 0.26130258, 0.27577711) for the standard deviation. For Map Diff, we use bilinear interpolation to resize images to a resolution of 512... The prompts are derived from those used to train CLIP for the Image Net dataset... Map Diff uses the prompt a close-up cropped png photo of a [object] with [state]... Table 5 highlights the impact of matching timestep and block numbers when calculating M Diff,V (Equation 8). |