Frozen Language Models Are Gradient Coherence Rectifiers in Vision Transformers
Authors: Lichen Bai, Zixuan Xiong, Hai Lin, Guangwei Xu, Xiangjin Xie, Ruijie Guo, Zhanhui Kang, Hai-Tao Zheng, Hong-Gee Kim
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate the effectiveness of this strategy, making the practical application of the gradient rectification effect feasible. ... Our experiments verify that the frozen LLM block has a certain gradient coherence rectification effect. ... Experiments Datasets Image Net-1K (Russakovsky et al. 2015) also known as ILSVRC 2012 ... Performance Evaluation After incorporating auxiliary training, we compare its performance with that of the vanilla ViT on Image Net and SSv2 dataset in terms of accuracy. ... Ablation Studies In Tab. 6, we focus on the impact of the weight of auxiliary training on overall performance. |
| Researcher Affiliation | Collaboration | 1Shenzhen International Graduate School, Tsinghua University 2Pengcheng Laboratory 3 Alibaba Cloud Computing 4 Machine Learning Platform Department, Tencent 5 Seoul National University |
| Pseudocode | No | The paper includes figures illustrating concepts and a framework diagram (Figure 6), but no explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the methodology described, nor does it provide links to any code repositories or mention code in supplementary materials. |
| Open Datasets | Yes | Datasets Image Net-1K (Russakovsky et al. 2015) ... Something-Something-v2 (Goyal et al. 2017) ... CIFAR-100 (Krizhevsky et al. 2009) ... Caltech-256 (Griffin, Holub, and Perona 2007) |
| Dataset Splits | Yes | To analyze gradient changes, we choose the Deit architecture (Touvron et al. 2021) and train it on CIFAR100 (Krizhevsky et al. 2009) for 300 epochs. ... We split the dataset into training and testing sets in a 7:3 ratio. |
| Hardware Specification | Yes | updates on A800 and RTX 3090 device. |
| Software Dependencies | No | The paper mentions using the AdamW optimizer and cosine annealing scheduling, but does not provide specific version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | Implementation Details For the video understanding task, we use Video MAE (Tong et al. 2022) and train it on the SSV2 dataset. we train for 40 epochs with a batch size set to 24 for ViT-S, and 30 epochs with a batch size of 12 for ViT-B. ... For image classification task, we utilize DEiT (Touvron et al. 2021) and train for 300 epochs. For Image Net, we set batch size to 1024. For Cifar-100 and Caltech-256, we set batch size to 256. And for Bar, we set batch size to 64. We use the AdamW optimizer with a learning rate set to 5e-4, weight decay set to 1e-5 and cosine annealing scheduling for updates on A800 and RTX 3090 device. |