Continual HyperTransformer: A Meta-Learner for Continual Few-Shot Learning
Authors: Max Vladymyrov, Andrey Zhmoginov, Mark Sandler
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our proposed Continual Hyper Transformer method equipped with a prototypical loss is capable of learning and retaining knowledge about past tasks for a variety of scenarios, including learning from mini-batches, and task-incremental and class-incremental learning scenarios. ... Most of our experiments were conducted using two standard benchmark problems using Omniglot and tiered Image Net datasets. |
| Researcher Affiliation | Industry | Max Vladymyrov EMAIL Google Research Andrey Zhmoginov EMAIL Google Research Mark Sandler EMAIL Google Research |
| Pseudocode | Yes | Algorithm 1 Class-incremental learning using Hyper Transformer with Prototypical Loss. Input: T randomly sampled K-way N-shot episodes: {S(t); Q(t)}T t=0. Output: The loss value J for the generated set of tasks. |
| Open Source Code | No | The paper does not provide an explicit statement about open-sourcing the code, nor does it include a link to a code repository. |
| Open Datasets | Yes | Most of our experiments were conducted using two standard benchmark problems using Omniglot and tiered Image Net datasets. ... We verify this by creating a multi-domain episode generator that includes tasks from various image datasets: Omniglot, Caltech101, Caltech Birds2011, Cars196, Oxford Flowers102 and Stanford Dogs. |
| Dataset Splits | No | The reported accuracy was calculated from 1024 random episodic evaluations from a separate test distribution, with each episode run 16 times with different combinations of input samples. ... We compare the performance of CHT to two baseline models. The first is a Constant Proto Net (Const PN), which represents a vanilla Prototypical Network, as described in Snell et al. (2017). In this approach, a universal fixed CNN network is trained on episodes from Ctrain. ... Finally, we can test the performance of the trained model aψ on episodes sampled from a holdout set of classes Ctest. |
| Hardware Specification | No | In all our experiments, we trained the network on a single GPU for 4M steps with SGD with an exponential LR decay over 100 000 steps with a decay rate of 0.97. |
| Software Dependencies | No | The paper mentions using 'SGD' as an optimizer and 'Transformer' and 'CNN' architectures but does not specify software dependencies with version numbers (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | In all our experiments, we trained the network on a single GPU for 4M steps with SGD with an exponential LR decay over 100 000 steps with a decay rate of 0.97. We noticed some stability issues when increasing the number of tasks and had to decrease the learning rate to compensate: for Omniglot experiments, we used a learning rate 10^-4 for up to 4 tasks and 5x10^-5 for 5 tasks. For tiered Image Net, we used the same learning rate of 5x10^-6 for training with any number of tasks T. ... The generated weights for each task θt are composed of four convolutional blocks and a single dense layer. Each of the convolutional blocks consist of a 3x3 convolutional layer, batch norm layer, ReLU activation and a 2x2 max-pooling layer. For Omniglot we used 8 filters for convolutional layers and 20-dim FC layer to demonstrate how the network works on small problems, and for tiered Image Net we used 64 filters for convolutional and 40-dim for the FC layer. |