reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning

Authors: Anh Tong, Thanh Nguyen-Tang, Dongeun Lee, Duc Nguyen, Toan Tran, David Hall, Cheongwoong Kang, Jaesik Choi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our neural ODE transformer demonstrates performance comparable to or better than vanilla transformers across various configurations and datasets, while offering flexible fine-tuning capabilities that can adapt to different architectural constraints. This section presents an empirical comparison of our proposed model DIFFEQFORMER and various baseline models in the context of language modeling tasks.
Researcher Affiliation	Collaboration	Anh Tong Korea University EMAIL Thanh Nguyen-Tang Johns Hopkins University EMAIL Dongeun Lee Texas A&M University-Commerce EMAIL Duc Nguyen Qualcomm AI Research EMAIL Toan Tran Qualcomm AI Research EMAIL David Hall Stanford EMAIL Cheongwoong Kang KAIST EMAIL Jaesik Choi KAIST, INEEJI EMAIL
Pseudocode	No	The paper describes methods and equations in prose and mathematical notation (e.g., equations 2, 3, 4, 5, 6) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The source code of this work is available at https://github.com/SDML-KU/qkvflow.
Open Datasets	Yes	We evaluate our model DIFFEQFORMER, in comparison to baselines on both OPENWEBTEXT (Gokaslan et al., 2019) and WIKITEXT103 (Merity et al., 2016) for the autoregressive modeling task. The main evaluation metric is perplexity, a commonly used measure in this task. We utilized the Wikitext dataset, accessible via the Hugging Face dataset ID: dlwh/wikitext 103 detokenized. Following the configuration from Dao et al. (2022), we partitioned the entire Open Web Text dataset into training and testing sets.
Dataset Splits	Yes	Following the configuration from Dao et al. (2022), we partitioned the entire Open Web Text dataset into training and testing sets. The test set corresponds to 0.0005 of the entire dataset.
Hardware Specification	Yes	We mainly use two A100 80GB GPUs for training our models. However, owing to hardware constraints, there are variations in the batch sizes used for training GPT-large, GPT-medium and GPTsmall on the OPENWEBTEXT dataset. Specifically, the batch size for GPT-medium and GPT-large is set to 256, while the batch size for GPT-small is configured at 512. Meanwhile, the batch size for training on the WIKITEXT103 dataset is set at 256 for all models. For experiments with the Llama-1B configuration, training was conducted using eight NVIDIA H100 GPUs, each with 80GB of memory.
Software Dependencies	No	The implementation of our model and baseline models is built on JAX (Bradbury et al., 2018), utilizing an ecosystem that includes Equinox (Kidger & Garcia, 2021), Haliax, and the Levanter framework (Hall et al., 2023). While several software components are mentioned, specific version numbers are not provided for JAX, Equinox, Haliax, or Levanter, which is required for reproducibility.
Experiment Setup	Yes	All models were trained using the Adam optimizer, along with a weight decay of 0.1 and dropout rate of 0.1. For GPT-small and GPT-medium models, we used Adam s default hyperparameters (β1 = 0.9, β2 = 0.999). Based on our findings in Section C.5, we modified these parameters for GPT-large and Llama-1B models, setting β1 = 0.9, β2 = 0.95. The number of warm-up steps was set to 1% of the total training steps. Additionally, the cosine learning rate schedule is used. The ratio between the minimum learning rate and the base learning rate was fixed at 0.1. [...] However, owing to hardware constraints, there are variations in the batch sizes used for training GPT-large, GPT-medium and GPTsmall on the OPENWEBTEXT dataset. Specifically, the batch size for GPT-medium and GPT-large is set to 256, while the batch size for GPT-small is configured at 512. Meanwhile, the batch size for training on the WIKITEXT103 dataset is set at 256 for all models.