reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reproducibility study of “LICO: Explainable Models with Language-Image Consistency"

Authors: Luan Fletcher, Robert van der Klis, Martin Sedláček, Stefan Vasilev, Christos Athanasiadis

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a comprehensive reproducibility study, employing (Wide) Res Nets and established interpretability methods like Grad-CAM and RISE. We were mostly unable to reproduce the authors results. In particular, we did not find that LICO consistently led to improved classification performance or improvements in quantitative and qualitative measures of interpretability. Thus, our findings highlight the importance of rigorous evaluation and transparent reporting in interpretability research.
Researcher Affiliation	Academia	Luan Fletcher EMAIL Department of Computer Science University of Amsterdam Robert van der Klis EMAIL Department of Computer Science University of Amsterdam Martin Sedláček EMAIL Department of Computer Science University of Amsterdam Stefan Vasilev EMAIL Department of Computer Science University of Amsterdam Christos Athanasiadis EMAIL Department of Computer Science University of Amsterdam
Pseudocode	No	The paper gives both an intuitive as well as a formal description of the losses, and pseudocode for the training algorithm, which is very helpful in understanding how the method works and why. (This sentence refers to the original LICO paper, not pseudocode present in this reproducibility study. No pseudocode block is found in the current paper.)
Open Source Code	Yes	Lastly, our code is available on Git Hub. https://github.com/robertdvdk/lico-fact
Open Datasets	Yes	We evaluate LICO s classification performance and measure Insertion/Deletion scores (Petsiuk et al., 2018; Zhang et al., 2021; Wang et al., 2020), as defined in Section 3.4, on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), Imagenet (Deng et al., 2009), and Imagenette (Howard, 2019) datasets. [...] We also evaluate the LICO model on the Part Image Net dataset (He et al., 2022).
Dataset Splits	Yes	For both datasets, we split the dataset into 47,500 training, 2,500 validation and 10,000 test samples. On CIFAR-100, we also create a subset consisting of only 2,500 examples to compare the performance of a baseline model vs. a model trained with LICO when training with limited data. [...] For Imagenette, we assess both classification accuracy and Insertion/Deletion scores (Section 3.4). The dataset is divided into 9000 training samples, 475 validation samples, and 3925 test samples.
Hardware Specification	Yes	For our runs we used NVIDIA A100 GPUs on a cluster.
Software Dependencies	No	The paper makes use of existing implementations of CAM-based interpretation methods (Gildenblat & contributors, 2021) and RISE (Ishikawa, 2019) but does not specify the versions used. Crucially, the paper does not state specific version numbers for key machine learning frameworks like PyTorch or TensorFlow, which are essential for reproducibility.
Experiment Setup	Yes	All hyperparameters are shown in Table 7. We keep α = 10 and β = 1 for all experiments. The batch size is kept at 64 for all setups other than Image Net, and the learning rate at 0.03. We use SGD as optimizer with momentum 0.9 and weight decay 0.0001. We apply the cosine learning rate scheduler η = η0 cos 7πk /16K from the paper (Lei et al., 2023), and we train on CIFAR-10/100 for 200 epochs and on Image Net and Imagenette for 90 epochs.