reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Authors: Haiyang Wang, Yue Fan, Muhammad Ferjad Naeem, Yongqin Xian, Jan E Lenssen, Liwei Wang, Federico Tombari, Bernt Schiele

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at https://github.com/Haiyang-W/TokenFormer.git. ... In this section, we present experimental results for the techniques described above. Section 4.1 validates the continual expansion capability of our model. Section 4.2 highlights the model s efficacy in handling tasks within both language and vision domains. Section 4.3 offers an in-depth comparison, highlighting our model s advantages over standard Transformer models. Finally, Section 4.4 details the ablation experiments conducted to assess the significance of each module in Tokenformer.
Researcher Affiliation	Collaboration	Haiyang Wang1,3 , Yue Fan1 , Muhammad Ferjad Naeem2, Yongqin Xian2, Jan Eric Lenssen1, Liwei Wang3, Federico Tombari2, Bernt Schiele1 1Max Planck Institute for Informatics, SIC 2Google 3Peking University
Pseudocode	No	The paper describes methodologies in Section 3, using mathematical formulations (e.g., equations 1-12) to explain the Tokenformer architecture and progressive scaling. However, it does not include any explicit section or figure labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps formatted like code blocks.
Open Source Code	Yes	Code and models are available at https://github.com/Haiyang-W/TokenFormer.git.
Open Datasets	Yes	Our models are trained using the Open Web Text Corpus described in (Gokaslan & Cohen, 2019). ... Training is performed on the Pile dataset (Gao et al., 2020)... We compare our approach against the standard Vision Transformer (Vi T) (Dosovitskiy et al., 2021) trained with supervised learning on the Image Net-1K dataset (Deng et al., 2009). ... In this experiment, we utilize the En Wik8 (Mahoney, 2011) dataset...
Dataset Splits	Yes	The dataset comprises 8,013,769 documents, from which we randomly select 5% to serve as the validation set and report perplexity on this subset. During training, we randomly sample segments from these documents.
Hardware Specification	Yes	All the experiments were conducted on TPU v4 hardware.
Software Dependencies	No	The paper mentions using the MMDetection code base for visual tasks, and the AdamW optimizer, but does not specify version numbers for any software libraries, programming languages, or environments used in the experiments.
Experiment Setup	Yes	Following the training procedures outlined in Karpathy (2022); Kaplan et al. (2020), we employed the Adam W optimizer (Loshchilov & Hutter, 2019) with a batch size of 512 sequences, each containing 1024 tokens. For a fair comparison with our incremental scaling approach, we configured two training variants based on the total number of training tokens. The first variant underwent 6 × 10^5 steps (approximately 300B tokens)... learning rate of 6 × 10^−4 was employed, featuring a 2000-step warmup followed by a cosine decay to zero.