reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection

Authors: Byeongchan Lee, John Won, Seunghyun Lee, Jinwoo Shin

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on benchmark datasets (MVTec-AD, Vis A) demonstrate that CLIPFUSION consistently outperforms baseline methods in both anomaly segmentation and classification under both zero-shot and few-shot settings. We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection, providing a scalable solution for real-world applications.
Researcher Affiliation	Academia	Byeongchan Lee EMAIL Korea Advanced Institute of Science and Technology (KAIST) John Won EMAIL Korea Advanced Institute of Science and Technology (KAIST) Seunghyun Lee EMAIL Korea Advanced Institute of Science and Technology (KAIST) Jinwoo Shin EMAIL Korea Advanced Institute of Science and Technology (KAIST)
Pseudocode	Yes	Algorithm 1 CLIPFUSION for Anomaly Segmentation Algorithm 2 CLIPFUSION for Anomaly Classification
Open Source Code	No	The paper does not provide an explicit statement of code release or a direct link to a code repository for the methodology described. It mentions using third-party libraries like Open CLIP and the diffusers library, but not its own implementation.
Open Datasets	Yes	We use the MVTec-AD (Bergmann et al., 2019) and Vis A (Zou et al., 2022) datasets, which are mainly used as benchmark datasets in anomaly detection.
Dataset Splits	Yes	Each dataset consists of several object categories, and each object category is divided into a training set containing only normal images and a test set containing a mixture of normal and abnormal images. Query images are from the test set, and reference images are sampled from the training set. For the k-shot case, k images are used to construct feature memory banks.
Hardware Specification	Yes	We report the inference latency measured on an NVIDIA A100 GPU (40 GB) in Table 7, with a batch size of 1.
Software Dependencies	Yes	For Patch CLIP, we employ Open CLIP (Ilharco et al., 2021; Cherti et al., 2023; Schuhmann et al., 2022)... For Map Diff, we utilize the pretrained Stable Diffusion v2 model (Rombach et al., 2022)... For Map Diff, we use bilinear interpolation to resize images to a resolution of 512 before feeding into a customized Stable Diffusion Inpainting pipeline from the diffusers library (von Platen et al., 2022).
Experiment Setup	Yes	We set α = 0.25 in Equation 4 for segmentation and α = 0.75 in Equation 9 for classification to reflect the different importance of the models in each task. For Patch CLIP, we use the same pre-processing pipeline as in Win CLIP (Jeong et al., 2023)... First, bicubic interpolation is used to resize images to a resolution of 240... channel-wise standardization is applied with the precomputed values of (0.48145466, 0.4578275, 0.40821073) for the mean and (0.26862954, 0.26130258, 0.27577711) for the standard deviation. For Map Diff, we use bilinear interpolation to resize images to a resolution of 512... The prompts are derived from those used to train CLIP for the Image Net dataset... Map Diff uses the prompt a close-up cropped png photo of a [object] with [state]... Table 5 highlights the impact of matching timestep and block numbers when calculating M Diff,V (Equation 8).