Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning
Authors: Anh Tong, Thanh Nguyen-Tang, Dongeun Lee, Duc Nguyen, Toan Tran, David Hall, Cheongwoong Kang, Jaesik Choi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our neural ODE transformer demonstrates performance comparable to or better than vanilla transformers across various configurations and datasets, while offering flexible fine-tuning capabilities that can adapt to different architectural constraints. This section presents an empirical comparison of our proposed model DIFFEQFORMER and various baseline models in the context of language modeling tasks. |
| Researcher Affiliation | Collaboration | Anh Tong Korea University EMAIL Thanh Nguyen-Tang Johns Hopkins University EMAIL Dongeun Lee Texas A&M University-Commerce EMAIL Duc Nguyen Qualcomm AI Research EMAIL Toan Tran Qualcomm AI Research EMAIL David Hall Stanford EMAIL Cheongwoong Kang KAIST EMAIL Jaesik Choi KAIST, INEEJI EMAIL |
| Pseudocode | No | The paper describes methods and equations in prose and mathematical notation (e.g., equations 2, 3, 4, 5, 6) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code of this work is available at https://github.com/SDML-KU/qkvflow. |
| Open Datasets | Yes | We evaluate our model DIFFEQFORMER, in comparison to baselines on both OPENWEBTEXT (Gokaslan et al., 2019) and WIKITEXT103 (Merity et al., 2016) for the autoregressive modeling task. The main evaluation metric is perplexity, a commonly used measure in this task. We utilized the Wikitext dataset, accessible via the Hugging Face dataset ID: dlwh/wikitext 103 detokenized. Following the configuration from Dao et al. (2022), we partitioned the entire Open Web Text dataset into training and testing sets. |
| Dataset Splits | Yes | Following the configuration from Dao et al. (2022), we partitioned the entire Open Web Text dataset into training and testing sets. The test set corresponds to 0.0005 of the entire dataset. |
| Hardware Specification | Yes | We mainly use two A100 80GB GPUs for training our models. However, owing to hardware constraints, there are variations in the batch sizes used for training GPT-large, GPT-medium and GPTsmall on the OPENWEBTEXT dataset. Specifically, the batch size for GPT-medium and GPT-large is set to 256, while the batch size for GPT-small is configured at 512. Meanwhile, the batch size for training on the WIKITEXT103 dataset is set at 256 for all models. For experiments with the Llama-1B configuration, training was conducted using eight NVIDIA H100 GPUs, each with 80GB of memory. |
| Software Dependencies | No | The implementation of our model and baseline models is built on JAX (Bradbury et al., 2018), utilizing an ecosystem that includes Equinox (Kidger & Garcia, 2021), Haliax, and the Levanter framework (Hall et al., 2023). While several software components are mentioned, specific version numbers are not provided for JAX, Equinox, Haliax, or Levanter, which is required for reproducibility. |
| Experiment Setup | Yes | All models were trained using the Adam optimizer, along with a weight decay of 0.1 and dropout rate of 0.1. For GPT-small and GPT-medium models, we used Adam s default hyperparameters (β1 = 0.9, β2 = 0.999). Based on our findings in Section C.5, we modified these parameters for GPT-large and Llama-1B models, setting β1 = 0.9, β2 = 0.95. The number of warm-up steps was set to 1% of the total training steps. Additionally, the cosine learning rate schedule is used. The ratio between the minimum learning rate and the base learning rate was fixed at 0.1. [...] However, owing to hardware constraints, there are variations in the batch sizes used for training GPT-large, GPT-medium and GPTsmall on the OPENWEBTEXT dataset. Specifically, the batch size for GPT-medium and GPT-large is set to 256, while the batch size for GPT-small is configured at 512. Meanwhile, the batch size for training on the WIKITEXT103 dataset is set at 256 for all models. |