reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Authors: Emiliano Penaloza, Tianyue H. Zhang, Laurent Charlin, Mateo Espinosa Zarlenga

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise. We make our code available on Github1. ... 5. Experiments: Here, we validate the LCPO objective in three different settings. First, we study LCPO in clean, optimal data, then under concept label noise, and finally in a streaming data context where we leverage a prior when computing our updates.
Researcher Affiliation	Academia	1Universit e de Montreal 2 Mila Qu ebec AI Institute 3 HEC Montr eal 4 University of Cambridge. Correspondence to: Emiliano Penaloza <EMAIL>.
Pseudocode	No	The paper includes mathematical formulations and derivations but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We make our code available on Github1. 1https://github.com/Emilianopp/Concept Preference Optimization
Open Datasets	Yes	Datasets We study our proposed objective on three real-world image datasets: Caltech-UCSD Birds-200-2011 (CUB) (Wah et al., 2011), Large-scale Celeb Faces Attributes (Celeb A) (Liu et al., 2015), and Animals with Attributes 2 (Aw A2) (Xian et al., 2019).
Dataset Splits	Yes	For CUB, ... split into a standard 70%-10%-20% train-validation-test split. For Aw A2, ... use the standard 70%-10%-20% train-validation-test split. Finally, for Celeb A, ... we use the same 70%-10%-20% train-validation-test split.
Hardware Specification	Yes	We train all models using RTX8000 Nvidia-GPU. ... Here, we provide a quantitative analysis of the additional compute it requires. Table 4 shows the average training time per epoch using a single RTX-4800 GPU (the same setup used for all reported experiments).
Software Dependencies	No	The paper mentions using a ResNet34 (He et al., 2015) backbone, ImageNet-1k, and PyTorch's built-in implementation of LBCE, but does not provide specific version numbers for any software libraries or frameworks.
Experiment Setup	Yes	We employ a Res Net34 (He et al., 2015) as the backbone image encoder kθ, pretrained on Image Net-1k (Russakovsky et al., 2015). ... We use a batch size of 512 for the Celeb dataset and 256 for for CUB and Aw A2. ... In all datasets we train for up to 200 epochs and early stop if the validation loss has not improved in 15 epochs. For fair evaluation across methods, we tune the learning rate for CEMs, CBMs, and Prob CBM. Specifically, for CUB and Aw A2 datasets, we explore learning rates {0.1, 0.01}, while for Celeb A, we expand the search to {0.1, 0.01, 0.05, 0.005}. Additionally, we set the hyper-parameter λ {1, 5, 10} for all methods. ... We note that while LCPO introduces a new parameter β, we choose not to tune it and set β = 1 for all experiments.