TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Authors: Haiyang Wang, Yue Fan, Muhammad Ferjad Naeem, Yongqin Xian, Jan E Lenssen, Liwei Wang, Federico Tombari, Bernt Schiele

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at https://github.com/Haiyang-W/TokenFormer.git. ... In this section, we present experimental results for the techniques described above. Section 4.1 validates the continual expansion capability of our model. Section 4.2 highlights the model s efficacy in handling tasks within both language and vision domains. Section 4.3 offers an in-depth comparison, highlighting our model s advantages over standard Transformer models. Finally, Section 4.4 details the ablation experiments conducted to assess the significance of each module in Tokenformer.
Researcher Affiliation Collaboration Haiyang Wang1,3 , Yue Fan1 , Muhammad Ferjad Naeem2, Yongqin Xian2, Jan Eric Lenssen1, Liwei Wang3, Federico Tombari2, Bernt Schiele1 1Max Planck Institute for Informatics, SIC 2Google 3Peking University
Pseudocode No The paper describes methodologies in Section 3, using mathematical formulations (e.g., equations 1-12) to explain the Tokenformer architecture and progressive scaling. However, it does not include any explicit section or figure labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps formatted like code blocks.
Open Source Code Yes Code and models are available at https://github.com/Haiyang-W/TokenFormer.git.
Open Datasets Yes Our models are trained using the Open Web Text Corpus described in (Gokaslan & Cohen, 2019). ... Training is performed on the Pile dataset (Gao et al., 2020)... We compare our approach against the standard Vision Transformer (Vi T) (Dosovitskiy et al., 2021) trained with supervised learning on the Image Net-1K dataset (Deng et al., 2009). ... In this experiment, we utilize the En Wik8 (Mahoney, 2011) dataset...
Dataset Splits Yes The dataset comprises 8,013,769 documents, from which we randomly select 5% to serve as the validation set and report perplexity on this subset. During training, we randomly sample segments from these documents.
Hardware Specification Yes All the experiments were conducted on TPU v4 hardware.
Software Dependencies No The paper mentions using the MMDetection code base for visual tasks, and the AdamW optimizer, but does not specify version numbers for any software libraries, programming languages, or environments used in the experiments.
Experiment Setup Yes Following the training procedures outlined in Karpathy (2022); Kaplan et al. (2020), we employed the Adam W optimizer (Loshchilov & Hutter, 2019) with a batch size of 512 sequences, each containing 1024 tokens. For a fair comparison with our incremental scaling approach, we configured two training variants based on the total number of training tokens. The first variant underwent 6 × 10^5 steps (approximately 300B tokens)... learning rate of 6 × 10^−4 was employed, featuring a 2000-step warmup followed by a cosine decay to zero.