On Exact Bit-level Reversible Transformers Without Changing Architecture

Authors: Guoqiang Zhang, Jp Lewis, W. Bastiaan Kleijn

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments in natural language generation, image classification, and language translation show that BDIA-transformers outperform their conventional counterparts significantly in terms of validation performance while also requiring considerably less training memory. Experimental results for natural language generation (NLG), image classification, and language translation show that the BDIA technique significantly improves the validation performance over that of the corresponding baseline transformers and simultaneously reduces training memory significantly.
Researcher Affiliation Collaboration 1Department of Computer Science, University of Exeter, UK 2NVDIA, USA 3 School of Engineering and Computer Science, Victoria University of Wellington, New Zealand.
Pseudocode No The paper describes methods using mathematical equations and descriptive text, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Source-code can be found via this link. Four open-source repositories were used in the experiments (see Table 7 in Appendix G).
Open Datasets Yes We consider fully fine-tuning GPT2 medium ... by using the E2E dataset (Novikova et al., 2017). ...we trained BDIA-Vi T ... on CIFAR10 and CIFAR100... The dataset being used is from Kaggle (Kelly, 2020). ...train BDIA-GPT2 on the openwebtext dataset.
Dataset Splits No The paper mentions using specific datasets like E2E, CIFAR10, CIFAR100, and openwebtext, and notes the use of a "0.05% subset" from openwebtext. However, it does not explicitly provide the train/validation/test split percentages or sample counts for these datasets in the main text, nor does it cite specific predefined splits for reproduction.
Hardware Specification Yes In this experiment, we trained BDIA-Vi T with K=6 transformer blocks on CIFAR10 and CIFAR100 by using a single 2080 Ti GPU.
Software Dependencies No The paper refers to using open-source repositories (Table 7) for implementing experiments, but it does not explicitly list key software components with their specific version numbers (e.g., Python, PyTorch, CUDA versions) within the text.
Experiment Setup Yes For comparison, we also fine-tune GPT2 directly and via the Lo RA technique with the default setup of (rank, α) = (4, 32). ...we utilized the SET-Adam optimizer (Zhang, 2024) in the training process with the configuration (η0, β1, β2, ϵ) = (1e 4, 0.9, 0.999, 1e 18), where η0 denotes the initial learning rate. The dropout rate was set to 0.1 to reduce over-fitting. ...The peak memory includes both the model parameters and the training states for a batchsize of 128. ...The tested BDIA-transformer has six transformer blocks in both the encoder and decoder, respectively.