reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts

Authors: Mateo Espinosa Zarlenga, Gabriele Dominici, Pietro Barbiero, Zohreh Shams, Mateja Jamnik

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our analysis reveals a weakness in current state-of-the-art CMs, which we term leakage poisoning, that prevents them from properly improving their accuracy when intervened on for OOD inputs. To address this, we introduce Mix CEM, a new CM that learns to dynamically exploit leaked information missing from its concepts only when this information is in-distribution. Our results across tasks with and without complete sets of concept annotations demonstrate that Mix CEMs outperform strong baselines by significantly improving their accuracy for both in-distribution and OOD samples in the presence and absence of concept interventions.
Researcher Affiliation	Collaboration	1University of Cambridge 2Università della Svizzera Italiana 3IBM Research 4Leap Laboratories Inc.
Pseudocode	No	The paper describes methods and objectives mathematically and in prose, and includes graphical models (Figure 6, Figure 7) and training objectives. However, it does not contain a clearly labeled pseudocode or algorithm block with structured steps formatted like code.
Open Source Code	Yes	Our code and experiment configs can be found at https://github.com/mateoespinosa/cem
Open Datasets	Yes	Datasets We study these questions on the following tasks: (1) CUB (Wah et al., 2011), a bird classification task with 200 classes and 112 concepts selected by Koh et al. (2020), (2) Aw A2 (Xian et al., 2018), an animal classification task with 50 classes and 85 concepts, (3) Celeb A (Liu et al., 2018), a face recognition task with 256 classes and 6 concepts selected by Espinosa Zarlenga et al. (2022), and (4) CIFAR-10 (Krizhevsky et al., 2009), a classification task with 10 classes and with 143 concepts obtained in an unsupervised manner by Oikarinen et al. (2023).
Dataset Splits	Yes	Aw A2 The train-validation-test data splits are produced via a random 60%-20%-20% split, and samples are randomly cropped and flipped during training as in CUB. CUB: For this task, and its incomplete version, we use the same train-validation-test splits as in (Koh et al., 2020).
Hardware Specification	Yes	We executed all experiments on a shared GPU cluster with four Nvidia Titan Xp GPUs and 40 Intel(R) Xeon(R) E5-2630 v4 CPUs (at 2.20GHz) with 125GB of RAM.
Software Dependencies	Yes	Our experiments were run on Py Torch 1.11.0 (Paszke et al., 2019) and facilitated by Py Torch Lightning 1.9.5 (Falcon, 2019). For our plots, we used matplotlib 3.5.1 (Hunter, 2007) and the open-sourced distribution of draw.io.
Experiment Setup	Yes	During training, we use the standard categorical cross-entropy loss as Ltask. ...we use a batch size of 64 for all CUB-based tasks... we use a batch size of 512 for all other tasks. Similarly, when possible, we fix the initial learning rate lr to values used by previous works and decay it during training by a factor of 10 if the training loss reaches a plateau after 10 epochs. Specifically, we use lr = 0.01 for all tasks except for Celeb A, where we use lr = 0.05... we use a weight decay 0.000004... All models were trained for a total of E epochs, where E = 150 for all datasets except for CIFAR10, where it is E = 50. We use early stopping by tracking the validation loss and stopping training if an improvement in validation loss has not been seen after (patience) (val freq) epochs, where patience = 5 and val freq, the frequency at which we evaluate our model on the validation set, is val freq = 5.