Continual HyperTransformer: A Meta-Learner for Continual Few-Shot Learning

Authors: Max Vladymyrov, Andrey Zhmoginov, Mark Sandler

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our proposed Continual Hyper Transformer method equipped with a prototypical loss is capable of learning and retaining knowledge about past tasks for a variety of scenarios, including learning from mini-batches, and task-incremental and class-incremental learning scenarios. ... Most of our experiments were conducted using two standard benchmark problems using Omniglot and tiered Image Net datasets.
Researcher Affiliation Industry Max Vladymyrov EMAIL Google Research Andrey Zhmoginov EMAIL Google Research Mark Sandler EMAIL Google Research
Pseudocode Yes Algorithm 1 Class-incremental learning using Hyper Transformer with Prototypical Loss. Input: T randomly sampled K-way N-shot episodes: {S(t); Q(t)}T t=0. Output: The loss value J for the generated set of tasks.
Open Source Code No The paper does not provide an explicit statement about open-sourcing the code, nor does it include a link to a code repository.
Open Datasets Yes Most of our experiments were conducted using two standard benchmark problems using Omniglot and tiered Image Net datasets. ... We verify this by creating a multi-domain episode generator that includes tasks from various image datasets: Omniglot, Caltech101, Caltech Birds2011, Cars196, Oxford Flowers102 and Stanford Dogs.
Dataset Splits No The reported accuracy was calculated from 1024 random episodic evaluations from a separate test distribution, with each episode run 16 times with different combinations of input samples. ... We compare the performance of CHT to two baseline models. The first is a Constant Proto Net (Const PN), which represents a vanilla Prototypical Network, as described in Snell et al. (2017). In this approach, a universal fixed CNN network is trained on episodes from Ctrain. ... Finally, we can test the performance of the trained model aψ on episodes sampled from a holdout set of classes Ctest.
Hardware Specification No In all our experiments, we trained the network on a single GPU for 4M steps with SGD with an exponential LR decay over 100 000 steps with a decay rate of 0.97.
Software Dependencies No The paper mentions using 'SGD' as an optimizer and 'Transformer' and 'CNN' architectures but does not specify software dependencies with version numbers (e.g., Python, PyTorch versions).
Experiment Setup Yes In all our experiments, we trained the network on a single GPU for 4M steps with SGD with an exponential LR decay over 100 000 steps with a decay rate of 0.97. We noticed some stability issues when increasing the number of tasks and had to decrease the learning rate to compensate: for Omniglot experiments, we used a learning rate 10^-4 for up to 4 tasks and 5x10^-5 for 5 tasks. For tiered Image Net, we used the same learning rate of 5x10^-6 for training with any number of tasks T. ... The generated weights for each task θt are composed of four convolutional blocks and a single dense layer. Each of the convolutional blocks consist of a 3x3 convolutional layer, batch norm layer, ReLU activation and a 2x2 max-pooling layer. For Omniglot we used 8 filters for convolutional layers and 20-dim FC layer to demonstrate how the network works on small problems, and for tiered Image Net we used 64 filters for convolutional and 40-dim for the FC layer.