PopulAtion Parameter Averaging (PAPA)
Authors: Alexia Jolicoeur-Martineau, Emy Gervais, Kilian FATRAS, Yan Zhang, Simon Lacoste-Julien
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate in Section 4 that PAPA and its variants lead to substantial performance gains when training small network populations (2-10 networks) from scratch with low compute (1 GPU). Our method increases the average accuracy of the population by up to 0.8% on CIFAR-10 (5-10 networks), 1.9% on CIFAR-100 (5-10 networks), and 1.6% on Image Net (2-3 networks). |
| Researcher Affiliation | Collaboration | Alexia Jolicoeur-Martineau EMAIL Samsung SAIT AI Lab, Montreal Emy Gervais EMAIL Independent Kilian Fatras EMAIL Mila, Mc Gill University Yan Zhang EMAIL Samsung SAIT AI Lab, Montreal Simon Lacoste-Julien EMAIL Mila, University of Montreal Samsung SAIT AI Lab, Montreal Canada CIFAR AI Chair |
| Pseudocode | Yes | Figure 1 shows an illustration of PAPA and Algorithm 1 provides the full description of PAPA and its variants (PAPA-all and PAPA-2). |
| Open Source Code | No | The paper does not explicitly provide a link to source code or an affirmative statement of code release. It states, "Our code will be released upon publication" in the abstract, which indicates future availability, not current access. |
| Open Datasets | Yes | For image classification, we train models from scratch on CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), and Image Net (Deng et al., 2009); we also fine-tune pre-trained models on CIFAR-100. For image segmentation, we train models from scratch on ISPRS Vaihingen (Rottensteiner et al., 2012). |
| Dataset Splits | Yes | For image classification, we only have access to train and test data; thereby, we remove 2% of the training data to use as evaluation data for the greedy soups. For the Vaihingen dataset (Rottensteiner et al., 2012), we follow the training procedure and Py Torch implementation from (Audebert et al., 2017). We use a UNet (Ronneberger et al., 2015) and the train, validation, and test splits from (Fatras et al., 2021). We use 11 tiles for training, 5 tiles for validation, and the remaining 17 tiles for testing our model. |
| Hardware Specification | Yes | For all experiments, we use a single GPU: A-100 40Gb (for Imagenet) or V-100 16Gb (for all other experiments). |
| Software Dependencies | No | The paper mentions software like PyTorch and optimizers like SGD, Adam, AdamW, but does not provide specific version numbers for these software dependencies, which are required for a reproducible description. |
| Experiment Setup | Yes | For training-from-scratch on CIFAR-10 and CIFAR-100, training is done over 300 epochs with a cosine learning rate (1e-1 to 1e-4) (Loshchilov and Hutter, 2016) using SGD with a weight decay of 1e-4. Batch size is 64 and REPAIR uses 5 forward-passes. For training-from-scratch on Image Net, training is done over 90 epochs with a cosine learning rate (1e-1 to 1e-4) (Loshchilov and Hutter, 2016) using SGD with a weight decay of 1e-4. Batch size is 256 and REPAIR is not used. |