Kolmogorov-Arnold Transformer
Authors: Xingyi Yang, Xinchao Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the advantages of KAT across various tasks, including image recognition, segmentation, detection, table classification, and graph classification. It consistently enhances performance over the standard transformer architectures of different model sizes. ... We empirically validate KAT across a range of vision tasks, including image recognition, object detection, and semantic segmentation. ... The results demonstrate that KAT outperforms traditional MLP-based transformers, with similar computational requirements. As shown in Figure 1, KAT-B achieves 82.3% accuracy on Image Net-1K, surpassing the Vi T-B by 3.1%. ... The experimental results demonstrate that the KAT models consistently outperform their counterparts on the IN-1k dataset, as shown in Table 5. |
| Researcher Affiliation | Academia | Xingyi Yang, Xinchao Wang National University of Singapore EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods using mathematical equations and textual descriptions. There are no explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor are there structured code-like procedures presented in an algorithm format. |
| Open Source Code | No | The paper mentions a GitHub link in reference (Chen et al., 2024b): 'Ziwen Chen, Gundavarapu, and WU DI. Vision-kan: Exploring the possibility of kan replacing mlp in vision transformer. https://github.com/chenziwenhaoshuai/Vision-KAN.git, 2024b.' However, this refers to prior work by other authors ('Vi T + KAN') and not the authors' own code for the Kolmogorov Arnold Transformer (KAT) methodology described in this paper. There is no explicit statement from the authors of this paper that they are releasing their code, nor is a direct link to their own code repository provided. |
| Open Datasets | Yes | We do experiments on Image Net-1K [59] image classification benchmark. ... We evaluate our approach on the MS-COCO2017 (Lin et al., 2014) dataset, a standard benchmark for object detection and instance segmentation. ... We evaluated our KAT model on the ADE20K dataset (Zhou et al., 2017). ... We test 15 publicly available binary classification datasets from UCI dataset (Bache & Lichman, 2013), Auto ML challenge (Guyon et al., 2019) and Kaggle. ... We replicate its experiments on ZINC (Irwin et al., 2012) for graph regression and PATTERN (Abbe, 2017) and CLUSTER (Dwivedi et al., 2023) for node classification. |
| Dataset Splits | Yes | We do experiments on Image Net-1K [59] image classification benchmark. Image Net-1K is one of the most widely-used datasets in computer vision which contains about 1.3M images of 1K classes on training set, and 50K images on validation set. ... We followed the standard 3 training schedule, which consists of 36 epochs. The training images were resized to 800 x 1333 pixels. ... This dataset comprises 150 semantic categories with 20,000 images in the training set and 2,000 in the validation set. |
| Hardware Specification | Yes | The experiments were carried out on 4 NVIDIA H100 GPUs. ... Our implementation was carried out using the Py Torch and mmsegmentation libraries, and the experiments were performed on two NVIDIA H100 GPUs. ... In addition to accuracy, we analyzed the computational cost of different activations by measuring throughput and peak memory usage on an NVIDIA A5000 GPU (Table 7). |
| Software Dependencies | No | Our implementation was based on the Py Torch and MMDetection (Chen et al., 2019) libraries... Our implementation was carried out using the Py Torch and mmsegmentation libraries... Rather than using pytorch with automatic differentiation, we implement it fully with CUDA (Nickolls et al., 2008). The paper mentions software such as PyTorch, MMDetection, mmsegmentation, and CUDA, but it does not specify any version numbers for these components. Therefore, it does not provide a reproducible description of ancillary software with specific version numbers. |
| Experiment Setup | Yes | We mainly follow the hyper-parameters of Dei T (Touvron et al., 2021). Specifically, models are trained for 300 epochs at 224x224 resolution. The patch size is set to 16. Data augmentation and regularization techniques include Rand Augment (Cubuk et al., 2020), Mixup (Zhang et al., 2018), Cut Mix (Yun et al., 2019), Random Erasing (Zhong et al., 2020), weight decay, Label Smoothing (Szegedy et al., 2016) and Stochastic Depth (Huang et al., 2016). We adopt Adam W (Loshchilov & Hutter, 2019) optimizer with batch size of 1024. ... The Adam W optimizer (Loshchilov & Hutter, 2019) was used with a learning rate of 0.0001 and a total batch size of 16. ... The hyper-parameter for training KAT model on Image Net-1k is shown in Table 14. |