PopulAtion Parameter Averaging (PAPA)

Authors: Alexia Jolicoeur-Martineau, Emy Gervais, Kilian FATRAS, Yan Zhang, Simon Lacoste-Julien

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate in Section 4 that PAPA and its variants lead to substantial performance gains when training small network populations (2-10 networks) from scratch with low compute (1 GPU). Our method increases the average accuracy of the population by up to 0.8% on CIFAR-10 (5-10 networks), 1.9% on CIFAR-100 (5-10 networks), and 1.6% on Image Net (2-3 networks).
Researcher Affiliation Collaboration Alexia Jolicoeur-Martineau EMAIL Samsung SAIT AI Lab, Montreal Emy Gervais EMAIL Independent Kilian Fatras EMAIL Mila, Mc Gill University Yan Zhang EMAIL Samsung SAIT AI Lab, Montreal Simon Lacoste-Julien EMAIL Mila, University of Montreal Samsung SAIT AI Lab, Montreal Canada CIFAR AI Chair
Pseudocode Yes Figure 1 shows an illustration of PAPA and Algorithm 1 provides the full description of PAPA and its variants (PAPA-all and PAPA-2).
Open Source Code No The paper does not explicitly provide a link to source code or an affirmative statement of code release. It states, "Our code will be released upon publication" in the abstract, which indicates future availability, not current access.
Open Datasets Yes For image classification, we train models from scratch on CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), and Image Net (Deng et al., 2009); we also fine-tune pre-trained models on CIFAR-100. For image segmentation, we train models from scratch on ISPRS Vaihingen (Rottensteiner et al., 2012).
Dataset Splits Yes For image classification, we only have access to train and test data; thereby, we remove 2% of the training data to use as evaluation data for the greedy soups. For the Vaihingen dataset (Rottensteiner et al., 2012), we follow the training procedure and Py Torch implementation from (Audebert et al., 2017). We use a UNet (Ronneberger et al., 2015) and the train, validation, and test splits from (Fatras et al., 2021). We use 11 tiles for training, 5 tiles for validation, and the remaining 17 tiles for testing our model.
Hardware Specification Yes For all experiments, we use a single GPU: A-100 40Gb (for Imagenet) or V-100 16Gb (for all other experiments).
Software Dependencies No The paper mentions software like PyTorch and optimizers like SGD, Adam, AdamW, but does not provide specific version numbers for these software dependencies, which are required for a reproducible description.
Experiment Setup Yes For training-from-scratch on CIFAR-10 and CIFAR-100, training is done over 300 epochs with a cosine learning rate (1e-1 to 1e-4) (Loshchilov and Hutter, 2016) using SGD with a weight decay of 1e-4. Batch size is 64 and REPAIR uses 5 forward-passes. For training-from-scratch on Image Net, training is done over 90 epochs with a cosine learning rate (1e-1 to 1e-4) (Loshchilov and Hutter, 2016) using SGD with a weight decay of 1e-4. Batch size is 256 and REPAIR is not used.