Reproducibility study of “LICO: Explainable Models with Language-Image Consistency"
Authors: Luan Fletcher, Robert van der Klis, Martin Sedláček, Stefan Vasilev, Christos Athanasiadis
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a comprehensive reproducibility study, employing (Wide) Res Nets and established interpretability methods like Grad-CAM and RISE. We were mostly unable to reproduce the authors results. In particular, we did not find that LICO consistently led to improved classification performance or improvements in quantitative and qualitative measures of interpretability. Thus, our findings highlight the importance of rigorous evaluation and transparent reporting in interpretability research. |
| Researcher Affiliation | Academia | Luan Fletcher EMAIL Department of Computer Science University of Amsterdam Robert van der Klis EMAIL Department of Computer Science University of Amsterdam Martin Sedláček EMAIL Department of Computer Science University of Amsterdam Stefan Vasilev EMAIL Department of Computer Science University of Amsterdam Christos Athanasiadis EMAIL Department of Computer Science University of Amsterdam |
| Pseudocode | No | The paper gives both an intuitive as well as a formal description of the losses, and pseudocode for the training algorithm, which is very helpful in understanding how the method works and why. (This sentence refers to the original LICO paper, not pseudocode present in this reproducibility study. No pseudocode block is found in the current paper.) |
| Open Source Code | Yes | Lastly, our code is available on Git Hub*. *https://github.com/robertdvdk/lico-fact |
| Open Datasets | Yes | We evaluate LICO s classification performance and measure Insertion/Deletion scores (Petsiuk et al., 2018; Zhang et al., 2021; Wang et al., 2020), as defined in Section 3.4, on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), Imagenet (Deng et al., 2009), and Imagenette (Howard, 2019) datasets. [...] We also evaluate the LICO model on the Part Image Net dataset (He et al., 2022). |
| Dataset Splits | Yes | For both datasets, we split the dataset into 47,500 training, 2,500 validation and 10,000 test samples. On CIFAR-100, we also create a subset consisting of only 2,500 examples to compare the performance of a baseline model vs. a model trained with LICO when training with limited data. [...] For Imagenette, we assess both classification accuracy and Insertion/Deletion scores (Section 3.4). The dataset is divided into 9000 training samples, 475 validation samples, and 3925 test samples. |
| Hardware Specification | Yes | For our runs we used NVIDIA A100 GPUs on a cluster. |
| Software Dependencies | No | The paper makes use of existing implementations of CAM-based interpretation methods (Gildenblat & contributors, 2021) and RISE (Ishikawa, 2019) but does not specify the versions used. Crucially, the paper does not state specific version numbers for key machine learning frameworks like PyTorch or TensorFlow, which are essential for reproducibility. |
| Experiment Setup | Yes | All hyperparameters are shown in Table 7. We keep α = 10 and β = 1 for all experiments. The batch size is kept at 64 for all setups other than Image Net, and the learning rate at 0.03. We use SGD as optimizer with momentum 0.9 and weight decay 0.0001. We apply the cosine learning rate scheduler η = η0 cos 7πk /16K from the paper (Lei et al., 2023), and we train on CIFAR-10/100 for 200 epochs and on Image Net and Imagenette for 90 epochs. |