Masked Capsule Autoencoders

Authors: Miles Everett, Mingjun Zhong, Georgios Leontidis

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across several experiments and ablations studies we demonstrate that similarly to CNNs and Vi Ts, Capsule Networks can also benefit from self-supervised pretraining, paving the way for further advancements in this neural network domain. For instance, by pretraining on the Imagenette dataset consisting of 10 classes of Imagenet-sized images we achieve state-of-the-art results for Capsule Networks, demonstrating a 9% improvement compared to our baseline model.
Researcher Affiliation Academia Miles Everett EMAIL Department of Computing Science University of Aberdeen, UK
Pseudocode No The paper includes mathematical equations for the self-routing mechanism (Equations 1, 2, 3) and loss functions (Equations 4, 5), but does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using and citing third-party libraries like the Pytorch Image Library (TIMM) and FVCore for calculations, but it does not provide an explicit statement or link to the source code for the proposed Masked Capsule Autoencoders (MCAE) methodology.
Open Datasets Yes Initially, we provide a sanity check on the MNIST dataset (Le Cun et al., 2010), to provide quick experimentation to ensure that our methods work at all. Next, we use both the Fashion MNIST and CIFAR-10 datasets (Xiao et al., 2017; Krizhevsky et al., 2009)... The Small NORB dataset (Le Cun et al., 2004)... Finally, we use the Imagenette and Imagewoof datasets (Howard, 2019a;b) to test our network s performance on larger, more realistic datasets.
Dataset Splits Yes When a validation dataset has not been predefined, we randomly split 10% of the training dataset to act as our validation dataset. The best model is tested once on the test set of our datasets, with the best model being chosen based on the epoch with the lowest validation loss... The augmentations that we use for this dataset are that we standardise and take random 32x32 crops during training. At test time, we centre crop the images to 32x32 as defined in (Ribeiro et al., 2020)... 1) Training only on azimuths in (300, 320, 340, 0, 20, 40) and test on azimuths in the range of 60 to 280. 2) Training on the elevations in (30, 35, 40) degrees from horizontal and then testing on elevations in the range of 45 to 70 degrees.
Hardware Specification No The paper states: "We would like to thank the University of Aberdeen s High Performance Computing facility for enabling this work and the anonymous reviewers for their constructive feedback." This is a general mention of a computing facility but lacks specific hardware details like GPU models, CPU models, or memory.
Software Dependencies No The paper mentions using "the SGD optimizer" and "the cosine annealing learning rate scheduler" but does not specify their versions. It also refers to "Pytorch Image Library (TIMM) Wightman (2019)" and "FVCore library FAIR (2023)" with citations, but does not provide specific version numbers for these libraries or the underlying machine learning framework (e.g., PyTorch version).
Experiment Setup Yes All of our experiments follow the same experimental setup. This involves to optionally pretrain the network minus the class capsules for 50 epochs, with 50% of patches removed on either removed patch or whole image reconstruction as a target. We then add the class capsules to our network and fully finetune it for 350 epochs, following the supervised training settings of (Hahn et al., 2019; Everett et al., 2023)... All models use the SGD optimizer with default settings and the cosine annealing learning rate scheduler with a 0.1 initial learning rate.