Entropy-based Activation Function Optimization: A Method on Searching Better Activation Functions
Authors: Haoyuan Sun, Zihao Wu, Bo Xia, Pu Chang, Zibin Dong, Yifu Yuan, Yongzhe Chang, Xueqian Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments conducted with vision transformer and its variants on CIFAR-10, CIFAR-100 and Image Net-1K datasets demonstrate the superiority of CRRe LU over existing corrections of Re LU. Extensive empirical studies on task of large language model (LLM) fine-tuning, CRRe LU exhibits superior performance compared to GELU, suggesting its broader potential for practical applications. |
| Researcher Affiliation | Academia | Haoyuan Sun1 , Zihao Wu2 , Bo Xia1, Pu Chang3, Zibin Dong2, Yifu Yuan2, Yongzhe Chang1 , Xueqian Wang1 1Tsinghua University 2Tianjin University 3Anhui Polytechnic University Equal contribution: EMAIL, EMAIL Corresponding authors: EMAIL, EMAIL |
| Pseudocode | Yes | Finally, we provide further details of CRRe LU in Appendix D, including python-like pseudocode of CRRe LU in Appendix D.1, and further discussion on properties of CRRe LU in Appendix D.2. APPENDIX D FURTHER DETAILS OF CRRELU D.1 CORRECTION REGULARIZED RELU (CRRELU) PSEUDOCODE Algorithm 1: Correction Regularized Re LU (CRRe LU) Pseudocode |
| Open Source Code | No | The paper includes pseudocode in Appendix D.1 (Algorithm 1) and mentions 'python-like pseudocode', but it does not provide an explicit statement about releasing the source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | Datasets. In experiments of image classification task, we adopt three datasets, ordered as CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and Image Net-1K (Deng et al., 2009) in terms of the number of classification categories. In experiments of large language model (LLM) fine-tuning task, we employ two human preference datasets: SHP (Ethayarajh et al., 2022) and HH (Bai et al., 2022). In Appendix F, we conduct experiments on Conv Ne Xt and Euro SAT to verify generalization of CRRe LU to network architecture and dataset. Helber et al., 2019 |
| Dataset Splits | Yes | Datasets. In experiments of image classification task, we adopt three datasets, ordered as CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and Image Net-1K (Deng et al., 2009) in terms of the number of classification categories. In experiments of large language model (LLM) fine-tuning task, we employ two human preference datasets: SHP (Ethayarajh et al., 2022) and HH (Bai et al., 2022). |
| Hardware Specification | Yes | We conduct all experiments within CIFAR10 and CIFAR100 on 4 RTX3090 and those within Image Net-1K on 4 NVIDIA L20 for 100 epochs using the Adam W optimizer with weight decay of 0.05, truncated normal initialization (we provide further discussion on the reason for abandoning pre-trained initialization method in Appendix J.1), gradient clipping norm of 1.0, cross entropy loss function, and cosine annealing learning rate scheduler with linear warm-up. All experiments are conducted three runs, we report the mean and standard derivation. LLM fine-tuning ... on 2 RTX3090. F.1 ADDITIONAL ARCHITECTURE ... experiments within CIFAR10 and CIFAR100 are conducted on 4 RTX3090 and those within Image Net1K are conducted on 4 NVIDIA L20. F.2 ADDITIONAL DATASET ... on a single RTX3090 |
| Software Dependencies | No | The paper mentions 'python-like pseudocode' and imports for 'torch', 'torch.nn', 'torch.nn.functional' in Algorithm 1, as well as 'timm' for data augmentation. However, specific version numbers for these software components (e.g., Python, PyTorch, timm) are not provided in the text. |
| Experiment Setup | Yes | Experimental hyperparameters. For all transformer-based architectures, we directly set ε to 0.01 without further optimization. Detailed experimental hyperparameters are provided in Appendix N. We conduct all experiments within CIFAR10 and CIFAR100 on 4 RTX3090 and those within Image Net-1K on 4 NVIDIA L20 for 100 epochs using the Adam W optimizer with weight decay of 0.05, truncated normal initialization... gradient clipping norm of 1.0, cross entropy loss function, and cosine annealing learning rate scheduler with linear warm-up. Table 10: Experimental settings of Vi T, Dei T and TNT on CIFAR-10 and CIFAR-100 datasets (Image Size, Patch Size, Embedding Dim, Optimizer, Learning Rate, Warm up, Gradient Clipping, Training Epochs, Batch Size, Loss Function, Normalization, Data Augmentation, Drop Out and Drop Path). Table 13: Experimental settings of GPT2 fine-tuning task (Batch Size, Optimizer, Learning Rate, Trainer, Max Gradient Norm, Max Length for an Input, Max Length for Prompt). |