reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Compositional Risk Minimization

Authors: Divyat Mahajan, Mohammad Pezeshki, Charles Arnal, Ioannis Mitliagkas, Kartik Ahuja, Pascal Vincent

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.
Researcher Affiliation	Collaboration	Work done at Meta Joint last author 1Meta FAIR 2Mila, Universit e de Montr eal. Correspondence to: Divyat Mahajan <EMAIL>.
Pseudocode	Yes	Algorithm 1 Compositional Risk Minimization (CRM) Input: Training set Dtrain = {(x, z)} Output: Classifier parameters ˆθ, ˆ W, B ... Algorithm 2 Compositional Risk Minimization (CRM) for 2 Attribute Case Input: training set Dtrain with examples (x, y, a), where y is the class to predict and a is an attribute spuriously correlated with y Output: classifier parameters θ, W, B .
Open Source Code	Yes	A practical method: CRM is a simple algorithm for training classifiers, which first trains an additive energy classifier and then adjusts the trained classifier for tackling compositional shifts. We empirically validate the superiority of CRM to other methods previously proposed for addressing subpopulation shifts. Our code repository can be accessed via the link in the footnote1. 1Github: facebookresearch/compositional-risk-minimization
Open Datasets	Yes	Following this procedure, we adapted Waterbirds (Wah et al., 2011), Celeb A (Liu et al., 2015), Meta Shift (Liang & Zou, 2022), Multi NLI (Williams et al., 2017), and Civil Comments (Borkan et al., 2019) for experiments. We also experiment with the NICO++ dataset (Zhang et al., 2023), where we already have Ztrain Ztest = Z as some groups were not present in the train dataset.
Dataset Splits	Yes	We repurpose these benchmarks for compositional shifts by discarding samples from one of the groups (z) in the train (and validation) dataset; but we don t change the test dataset, i.e., z Ztrain but z Ztest. Let us denote the data splits from the standard benchmarks as (Dtrain, Dval, Dtest). Then we generate multiple variants of compositional shifts {(D z train, D z val, Dtest) \| z Z }, where D z train and D z val are generated by discarding samples from Dtrain and Dval that belong to the group z. Table 3. Statitics for the different benchmarks used in our experiments. (Contains columns for Train Size, Val Size, Test Size)
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or cloud computing instance specifications. It mentions using ResNet50 and BERT as backbones but not the underlying hardware.
Software Dependencies	No	The paper mentions using Python-based tools like PyTorch (implied by PyTorch implementation snippet and use of AdamW optimizer by Paszke et al., 2017), but it does not specify any version numbers for Python, PyTorch, or other libraries. For example, it lists 'import torch' and 'import torchvision' but without versions.
Experiment Setup	Yes	Hyperparameter Selection. We rely on the group balanced accuracy on the validation set to determine the optimal hyperparameters. We specify the grids for each hyperparameter in Table 4, and train each method with 5 randomly drawn hyperparameters. The grid sizes for hyperparameter selection were designed following Pezeshki et al. (2023). Table 4. Details about the grids for hyperparameter selection. The choices for grid sizes were taken from Pezeshki et al. (2023). (Contains columns for Learning Rate, Weight Decay, Batch Size, Total Epochs with specific uniform ranges).