Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization
Authors: Emiliano Penaloza, Tianyue H. Zhang, Laurent Charlin, Mateo Espinosa Zarlenga
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise. We make our code available on Github1. ... 5. Experiments: Here, we validate the LCPO objective in three different settings. First, we study LCPO in clean, optimal data, then under concept label noise, and finally in a streaming data context where we leverage a prior when computing our updates. |
| Researcher Affiliation | Academia | 1Universit e de Montreal 2 Mila Qu ebec AI Institute 3 HEC Montr eal 4 University of Cambridge. Correspondence to: Emiliano Penaloza <EMAIL>. |
| Pseudocode | No | The paper includes mathematical formulations and derivations but does not present any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We make our code available on Github1. 1https://github.com/Emilianopp/Concept Preference Optimization |
| Open Datasets | Yes | Datasets We study our proposed objective on three real-world image datasets: Caltech-UCSD Birds-200-2011 (CUB) (Wah et al., 2011), Large-scale Celeb Faces Attributes (Celeb A) (Liu et al., 2015), and Animals with Attributes 2 (Aw A2) (Xian et al., 2019). |
| Dataset Splits | Yes | For CUB, ... split into a standard 70%-10%-20% train-validation-test split. For Aw A2, ... use the standard 70%-10%-20% train-validation-test split. Finally, for Celeb A, ... we use the same 70%-10%-20% train-validation-test split. |
| Hardware Specification | Yes | We train all models using RTX8000 Nvidia-GPU. ... Here, we provide a quantitative analysis of the additional compute it requires. Table 4 shows the average training time per epoch using a single RTX-4800 GPU (the same setup used for all reported experiments). |
| Software Dependencies | No | The paper mentions using a ResNet34 (He et al., 2015) backbone, ImageNet-1k, and PyTorch's built-in implementation of LBCE, but does not provide specific version numbers for any software libraries or frameworks. |
| Experiment Setup | Yes | We employ a Res Net34 (He et al., 2015) as the backbone image encoder kθ, pretrained on Image Net-1k (Russakovsky et al., 2015). ... We use a batch size of 512 for the Celeb dataset and 256 for for CUB and Aw A2. ... In all datasets we train for up to 200 epochs and early stop if the validation loss has not improved in 15 epochs. For fair evaluation across methods, we tune the learning rate for CEMs, CBMs, and Prob CBM. Specifically, for CUB and Aw A2 datasets, we explore learning rates {0.1, 0.01}, while for Celeb A, we expand the search to {0.1, 0.01, 0.05, 0.005}. Additionally, we set the hyper-parameter λ {1, 5, 10} for all methods. ... We note that while LCPO introduces a new parameter β, we choose not to tune it and set β = 1 for all experiments. |