Equivariant Neural Functional Networks for Transformers
Authors: Viet-Hoang Tran, Thieu Vo, An Nguyen, Tho-Huu Tran, Minh-Khoi Nguyen-Nhat, Thanh Tran, Duy-Tung Pham, Tan Nguyen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Additionally, we release a dataset of over 125,000 Transformers model checkpoints trained on two datasets with two tasks, providing a benchmark for evaluating Transformer-NFN and encouraging further research on transformer training and performance. We empirically demonstrate that Transformer-NFN consistently outperforms other baseline models on our constructed datasets. Through comprehensive ablation studies, we emphasize Transformer NFN s ability to effectively capture information within the transformer block, establishing it as a robust predictor of model generalization. |
| Researcher Affiliation | Collaboration | 1National University of Singapore 2FPT Software AI Center, Vietnam 3Vin University, Vietnam EMAIL, EMAIL EMAIL, EMAIL |
| Pseudocode | Yes | I Implementation of Equivariant and Invariant Layers I.1 Summary of Equivariant and Invariant Layers I.1.1 Equivariant Layers with bullet notation I.1.2 Invariant Layers with bullet notation I.2 Equivariant Layers Pseudocode I.2.1 [E(W)](Q:i) j,k Pseudocode I.2.2 [E(W)](K:i) j,k Pseudocode I.2.3 [E(W)](V :i) j,k Pseudocode I.2.4 [E(W)](O:i) j,k Pseudocode I.2.5 [E(W)](A) j,k Pseudocode I.2.6 [E(b)](A) k Pseudocode I.2.7 [E(W)](B) j,k Pseudocode I.2.8 [E(b)](B) k Pseudocode I.3 Invariant Layers Pseudocode |
| Open Source Code | Yes | The code is publicly available at https://github.com/Mathematical AI-NUS/Transformer-NFN. Reproducibility Statement. Source codes for our experiments are provided in the supplementary materials of the paper. |
| Open Datasets | Yes | Additionally, we release a dataset of over 125,000 Transformers model checkpoints trained on two datasets with two tasks, providing a benchmark for evaluating Transformer-NFN and encouraging further research on transformer training and performance. 4. We release Small Transformer Zoo dataset, which consists of more than 125,000 Transformers model checkpoints trained on two different tasks: digit image classification on MNIST and text topic classificaction on AGNews. To our knowledge, this the first dataset of its kind. Reproducibility Statement. ... All datasets used in this paper are publicly available through an anonymous link provided in the README file of the supplementary material. |
| Dataset Splits | No | The paper uses the Small Transformer Zoo dataset, which consists of model checkpoints. For the experiments, it states: "we evaluate each model s prediction performance not only on the entire dataset but also on four smaller subsets, each filtered by accuracy thresholds of 20%, 40%, 60%, and 80%." While this describes evaluation subsets, it does not explicitly provide the training/test/validation splits used for the Transformer-NFN itself, nor the splits for the underlying MNIST and AGNews datasets used to train the transformer models in the zoo. |
| Hardware Specification | No | The paper does not provide specific hardware details (like GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It only mentions computational aspects in a general sense, for example, for efficiency: "enabling efficient and highly parallelizable computations on modern GPUs.". |
| Software Dependencies | No | The paper mentions several software components and libraries, such as "Adam optimizer", "XGBoost (Chen & Guestrin, 2016)", "Light GBM (Ke et al., 2017)", "Random Forest (Breiman, 2001)", and concepts like "einsum". However, it does not provide specific version numbers for any of these components, which are necessary for reproducible software dependency information. |
| Experiment Setup | Yes | To create a wide range of transformer model, we opt to vary six hyperparameters in our experiments: train fraction, optimizer (SGD, SGDm, Adam, or RMSprop), learning rate, L2 regularization coefficient, weight initialization standard deviation, and dropout probability. ... Table 4 provides a detailed overview of our hyperparameter configurations. Overall, there are 8000 configurations for each category, resulting in 16000 configurations in total. These configurations are consistently applied across both tasks to ensure comparability. All models are trained for 100 epochs, with checkpoints and accuracy measurements recorded at epochs 50, 75, 100, and at the epoch with the best accuracy. Training details The models were trained for a total of 50 epochs, using a batch size of 16. We employed the Adam optimizer with a maximum learning rate of 10-3. A linear warmup strategy was applied to the learning rate, spanning the initial 10 epochs for gradual warmup. We utilize Binary Cross Entropy for the loss function. In our experimental setup, the embedding component is modeled using a single-layer MLP with 10 hidden neurons, while the classifier component is a two-layer MLP, each layer containing 10 hidden neurons. |