reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Entropy-based Activation Function Optimization: A Method on Searching Better Activation Functions

Authors: Haoyuan Sun, Zihao Wu, Bo Xia, Pu Chang, Zibin Dong, Yifu Yuan, Yongzhe Chang, Xueqian Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments conducted with vision transformer and its variants on CIFAR-10, CIFAR-100 and Image Net-1K datasets demonstrate the superiority of CRRe LU over existing corrections of Re LU. Extensive empirical studies on task of large language model (LLM) fine-tuning, CRRe LU exhibits superior performance compared to GELU, suggesting its broader potential for practical applications.
Researcher Affiliation	Academia	Haoyuan Sun1 , Zihao Wu2 , Bo Xia1, Pu Chang3, Zibin Dong2, Yifu Yuan2, Yongzhe Chang1 , Xueqian Wang1 1Tsinghua University 2Tianjin University 3Anhui Polytechnic University Equal contribution: EMAIL, EMAIL Corresponding authors: EMAIL, EMAIL
Pseudocode	Yes	Finally, we provide further details of CRRe LU in Appendix D, including python-like pseudocode of CRRe LU in Appendix D.1, and further discussion on properties of CRRe LU in Appendix D.2. APPENDIX D FURTHER DETAILS OF CRRELU D.1 CORRECTION REGULARIZED RELU (CRRELU) PSEUDOCODE Algorithm 1: Correction Regularized Re LU (CRRe LU) Pseudocode
Open Source Code	No	The paper includes pseudocode in Appendix D.1 (Algorithm 1) and mentions 'python-like pseudocode', but it does not provide an explicit statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets	Yes	Datasets. In experiments of image classification task, we adopt three datasets, ordered as CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and Image Net-1K (Deng et al., 2009) in terms of the number of classification categories. In experiments of large language model (LLM) fine-tuning task, we employ two human preference datasets: SHP (Ethayarajh et al., 2022) and HH (Bai et al., 2022). In Appendix F, we conduct experiments on Conv Ne Xt and Euro SAT to verify generalization of CRRe LU to network architecture and dataset. Helber et al., 2019
Dataset Splits	Yes	Datasets. In experiments of image classification task, we adopt three datasets, ordered as CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and Image Net-1K (Deng et al., 2009) in terms of the number of classification categories. In experiments of large language model (LLM) fine-tuning task, we employ two human preference datasets: SHP (Ethayarajh et al., 2022) and HH (Bai et al., 2022).
Hardware Specification	Yes	We conduct all experiments within CIFAR10 and CIFAR100 on 4 RTX3090 and those within Image Net-1K on 4 NVIDIA L20 for 100 epochs using the Adam W optimizer with weight decay of 0.05, truncated normal initialization (we provide further discussion on the reason for abandoning pre-trained initialization method in Appendix J.1), gradient clipping norm of 1.0, cross entropy loss function, and cosine annealing learning rate scheduler with linear warm-up. All experiments are conducted three runs, we report the mean and standard derivation. LLM fine-tuning ... on 2 RTX3090. F.1 ADDITIONAL ARCHITECTURE ... experiments within CIFAR10 and CIFAR100 are conducted on 4 RTX3090 and those within Image Net1K are conducted on 4 NVIDIA L20. F.2 ADDITIONAL DATASET ... on a single RTX3090
Software Dependencies	No	The paper mentions 'python-like pseudocode' and imports for 'torch', 'torch.nn', 'torch.nn.functional' in Algorithm 1, as well as 'timm' for data augmentation. However, specific version numbers for these software components (e.g., Python, PyTorch, timm) are not provided in the text.
Experiment Setup	Yes	Experimental hyperparameters. For all transformer-based architectures, we directly set ε to 0.01 without further optimization. Detailed experimental hyperparameters are provided in Appendix N. We conduct all experiments within CIFAR10 and CIFAR100 on 4 RTX3090 and those within Image Net-1K on 4 NVIDIA L20 for 100 epochs using the Adam W optimizer with weight decay of 0.05, truncated normal initialization... gradient clipping norm of 1.0, cross entropy loss function, and cosine annealing learning rate scheduler with linear warm-up. Table 10: Experimental settings of Vi T, Dei T and TNT on CIFAR-10 and CIFAR-100 datasets (Image Size, Patch Size, Embedding Dim, Optimizer, Learning Rate, Warm up, Gradient Clipping, Training Epochs, Batch Size, Loss Function, Normalization, Data Augmentation, Drop Out and Drop Path). Table 13: Experimental settings of GPT2 fine-tuning task (Batch Size, Optimizer, Learning Rate, Trainer, Max Gradient Norm, Max Length for an Input, Max Length for Prompt).