Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Authors: Emiliano Penaloza, Tianyue H. Zhang, Laurent Charlin, Mateo Espinosa Zarlenga

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise. We make our code available on Github1. ... 5. Experiments: Here, we validate the LCPO objective in three different settings. First, we study LCPO in clean, optimal data, then under concept label noise, and finally in a streaming data context where we leverage a prior when computing our updates.
Researcher Affiliation Academia 1Universit e de Montreal 2 Mila Qu ebec AI Institute 3 HEC Montr eal 4 University of Cambridge. Correspondence to: Emiliano Penaloza <EMAIL>.
Pseudocode No The paper includes mathematical formulations and derivations but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We make our code available on Github1. 1https://github.com/Emilianopp/Concept Preference Optimization
Open Datasets Yes Datasets We study our proposed objective on three real-world image datasets: Caltech-UCSD Birds-200-2011 (CUB) (Wah et al., 2011), Large-scale Celeb Faces Attributes (Celeb A) (Liu et al., 2015), and Animals with Attributes 2 (Aw A2) (Xian et al., 2019).
Dataset Splits Yes For CUB, ... split into a standard 70%-10%-20% train-validation-test split. For Aw A2, ... use the standard 70%-10%-20% train-validation-test split. Finally, for Celeb A, ... we use the same 70%-10%-20% train-validation-test split.
Hardware Specification Yes We train all models using RTX8000 Nvidia-GPU. ... Here, we provide a quantitative analysis of the additional compute it requires. Table 4 shows the average training time per epoch using a single RTX-4800 GPU (the same setup used for all reported experiments).
Software Dependencies No The paper mentions using a ResNet34 (He et al., 2015) backbone, ImageNet-1k, and PyTorch's built-in implementation of LBCE, but does not provide specific version numbers for any software libraries or frameworks.
Experiment Setup Yes We employ a Res Net34 (He et al., 2015) as the backbone image encoder kθ, pretrained on Image Net-1k (Russakovsky et al., 2015). ... We use a batch size of 512 for the Celeb dataset and 256 for for CUB and Aw A2. ... In all datasets we train for up to 200 epochs and early stop if the validation loss has not improved in 15 epochs. For fair evaluation across methods, we tune the learning rate for CEMs, CBMs, and Prob CBM. Specifically, for CUB and Aw A2 datasets, we explore learning rates {0.1, 0.01}, while for Celeb A, we expand the search to {0.1, 0.01, 0.05, 0.005}. Additionally, we set the hyper-parameter λ {1, 5, 10} for all methods. ... We note that while LCPO introduces a new parameter β, we choose not to tune it and set β = 1 for all experiments.