Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Zero-Shot Natural Language Explanations
Authors: Fawaz Sammani, Nikos Deligiannis
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on 38 vision models, including both CNNs and Transformers. Our method outperforms supervised baselines on many metrics, while remaining comparable on others. |
| Researcher Affiliation | Collaboration | Fawaz Sammani & Nikos Deligiannis ETRO Department, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium imec, Kapeldreef 75, B-3001 Leuven, Belgium |
| Pseudocode | No | The paper describes the methods in prose and with diagrams (Figure 3, Figure 4), but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code for the described methodology, nor does it provide a direct link to a code repository. It mentions using third-party libraries and models. |
| Open Datasets | Yes | We use the challenging Image Net-1K dataset as our benchmark... Image Net-X (Sammani & Deligiannis, 2023)... COCO dataset (Lin et al., 2014)... Places365 dataset (Zhou et al., 2017)... DTD dataset (Cimpoi et al., 2014) |
| Dataset Splits | Yes | We use the challenging Image Net-1K dataset as our benchmark, splitting its 1,000 classes into 900 for training and 100 for testing. For validation and hyperparameter tuning, we use 100 non-overlapping classes from the Image Net-21K dataset. ... Image Net-X (Sammani & Deligiannis, 2023) dataset ... It consists of 141K training samples, 2K for validation and 1K for testing. ... We report results on the common Karpathy test split benchmark using various vision classifiers. |
| Hardware Specification | Yes | On a single RTX3090 GPU, it takes roughly 10 seconds to train. |
| Software Dependencies | No | The paper mentions software components like 'Stable Diffusion v1.5 model', 'Hugging Face Diffusers library', 'smallest GPT-2 (Radford et al., 2019) model', 'Adam Optimizer (Kingma & Ba, 2015)', 'Sentence Transformers library', 'torchvision library', 'timm library', and 'huggingface library'. However, it does not provide specific version numbers for general ancillary software such as Python, PyTorch, or CUDA libraries. |
| Experiment Setup | Yes | The MLP is trained with full batch gradient descent for 2500 epochs using the Adam Optimizer (Kingma & Ba, 2015) with a learning rate of 5e-3 and a cosine annealing schedule (Loshchilov & Hutter, 2017). ... The number of learnable prefixes is set to 5 in each attention block of the GPT-2 model. ... We train the prefixes with the Adam optimizer ... with a learning rate of 0.01 and a weight decay of 0.3 with a cosine annealing schedule for I = 20 iterations. ... The number of K tokens sampled at each timestep is set to 512. We use a maximum NLE length of 20. ... We add the fluency loss from Tewel et al. (2021) with a weight of 0.8. ... We also prevent the generation of repeated n-grams of order 3 to avoid repetitive phrases, by setting their score to negative infinity... We also enforce a minimum sequence length of 10 tokens by setting the <.> token score to 0, in order to prevent premature termination. |