On Exact Bit-level Reversible Transformers Without Changing Architecture
Authors: Guoqiang Zhang, Jp Lewis, W. Bastiaan Kleijn
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments in natural language generation, image classification, and language translation show that BDIA-transformers outperform their conventional counterparts significantly in terms of validation performance while also requiring considerably less training memory. Experimental results for natural language generation (NLG), image classification, and language translation show that the BDIA technique significantly improves the validation performance over that of the corresponding baseline transformers and simultaneously reduces training memory significantly. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of Exeter, UK 2NVDIA, USA 3 School of Engineering and Computer Science, Victoria University of Wellington, New Zealand. |
| Pseudocode | No | The paper describes methods using mathematical equations and descriptive text, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source-code can be found via this link. Four open-source repositories were used in the experiments (see Table 7 in Appendix G). |
| Open Datasets | Yes | We consider fully fine-tuning GPT2 medium ... by using the E2E dataset (Novikova et al., 2017). ...we trained BDIA-Vi T ... on CIFAR10 and CIFAR100... The dataset being used is from Kaggle (Kelly, 2020). ...train BDIA-GPT2 on the openwebtext dataset. |
| Dataset Splits | No | The paper mentions using specific datasets like E2E, CIFAR10, CIFAR100, and openwebtext, and notes the use of a "0.05% subset" from openwebtext. However, it does not explicitly provide the train/validation/test split percentages or sample counts for these datasets in the main text, nor does it cite specific predefined splits for reproduction. |
| Hardware Specification | Yes | In this experiment, we trained BDIA-Vi T with K=6 transformer blocks on CIFAR10 and CIFAR100 by using a single 2080 Ti GPU. |
| Software Dependencies | No | The paper refers to using open-source repositories (Table 7) for implementing experiments, but it does not explicitly list key software components with their specific version numbers (e.g., Python, PyTorch, CUDA versions) within the text. |
| Experiment Setup | Yes | For comparison, we also fine-tune GPT2 directly and via the Lo RA technique with the default setup of (rank, α) = (4, 32). ...we utilized the SET-Adam optimizer (Zhang, 2024) in the training process with the configuration (η0, β1, β2, ϵ) = (1e 4, 0.9, 0.999, 1e 18), where η0 denotes the initial learning rate. The dropout rate was set to 0.1 to reduce over-fitting. ...The peak memory includes both the model parameters and the training states for a batchsize of 128. ...The tested BDIA-transformer has six transformer blocks in both the encoder and decoder, respectively. |