reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

COAT: Compressing Optimizer states and Activations for Memory-Efficient FP8 Training

Authors: Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, Song Han

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that COAT effectively reduces end-to-end training memory footprint by 1.54 compared to BF16 while achieving nearly lossless performance across various tasks, such as Large Language Model pretraining and fine-tuning and Vision Language Model training. COAT also achieves a 1.43 end-to-end training speedup compared to BF16, performing on par with or surpassing Transformer Engine s speedup. COAT enables efficient full-parameter training of large models on fewer GPUs, and facilitates doubling the batch size in distributed training settings, providing a practical solution for scaling large-scale model training.
Researcher Affiliation	Collaboration	1 University of California, Berkeley 2 NVIDIA 3 MIT 4 Tsinghua University
Pseudocode	No	The paper includes mathematical equations for optimizer update rules and quantization, and describes methods in text, but does not present a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps formatted like code.
Open Source Code	Yes	The code is available at https://github.com/NVlabs/COAT
Open Datasets	Yes	For LLM pertaining, we report the perplexity on Wikitext 103 (Merity et al., 2016), C4 (Raffel et al., 2020), and Pile (Gao et al., 2020), and the accuracy on COPA (Gordon et al., 2012), ARC (Clark et al., 2018), Sci Q (Welbl et al., 2017), and Hella Swag (Zellers ets al., 2019). For LLM fine-tuning, we conduct experiments in math corpus, and evaluate on Mathmeticas (Davies et al., 2021), SVAMP (Patel et al., 2021), Num GLUE (Mishra et al., 2022), and GSM8K (Cobbe et al., 2021). For VLM training, we report the score on Video MME (Fu et al., 2024), POPE (Li et al., 2023b), Viz Wiz (Gurari et al., 2018), GQA (Hudson & Manning, 2019), VQAv2 (Goyal et al., 2017), Text VQA (Singh et al., 2019), SEED (Li et al., 2023a), and MMMU Validation Set (Yue et al., 2024). We train OLMo-1B (Groeneveld et al., 2024) and OLMo-7B on Dolma (Soldaini et al., 2024).
Dataset Splits	No	The paper mentions various well-known datasets and benchmarks, and specifies training configurations such as pretraining for 300B tokens or fine-tuning for 3 epochs. However, it does not explicitly provide specific training/test/validation split percentages, sample counts, or explicit citations for the exact splits used for reproduction, instead relying on the implicit standard splits of the referenced datasets.
Hardware Specification	Yes	With the advent of Nvidia s H100 GPU (NVIDIA, 2024a), FP8 training Micikevicius et al. (2022) is emerging as the next-generation low-precision technique. (...) (c) End-to-end per-GPU memory comparison when training Llama-2-13B on 8 80G H100 using FSDP.
Software Dependencies	No	We use PyTorch FSDP in our experiments. (...) The COAT LLaMA implementation introduces three new modules, which are not present in the Hugging Face implementation (...) The linear layers utilize Triton to implement custom FP8 matrix multiplication kernels. The paper mentions software components like PyTorch FSDP, Hugging Face, and Triton, but does not specify their version numbers, which are crucial for reproducibility.
Experiment Setup	Yes	For all experiments, we adopt the default hyperparameters in the official training recipe. (...) For OLMo-1B, we conduct pretraining for 300B tokens, which corresponds to 75k training steps. (...) Following the official report, we use a global batch size of 4M tokens (2048 macro batch size, with a sequence length of 2048 tokens). (...) We use 1 128 per-group quantization for optimizer states and 1 16 per-group quantization for non-linear layer activations. (...) We train for 3 epochs, and report the downstream tasks performance in 5. (...) and set the global batch size to 1024. We pad the sequence length to a multiple of 4 for efficiency consideration.