QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

Authors: Andrei Panferov, Jiale Chen, Soroush Tabesh, Mahdi Nikdan, Dan Alistarh

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Llama-type architectures show that Qu EST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by Qu EST can be executed efficiently.
Researcher Affiliation Collaboration 1ISTA 2Red Hat AI. Correspondence to: Dan Alistarh <EMAIL>.
Pseudocode Yes Algorithm 1 Qu EST Training Forward; Algorithm 2 Qu EST Training Backward
Open Source Code Yes Our code is available at https: //github.com/IST-DASLab/Qu EST.
Open Datasets Yes We trained all models on tokens from the C4 (Dodge et al., 2021) dataset, tokenized with the Llama 2 tokenizer.
Dataset Splits No The paper mentions using the C4 dataset for training and shows 'C4 Val Loss', but it does not provide specific details on how the dataset was split into training, validation, or test sets (e.g., percentages, sample counts, or methodology).
Hardware Specification Yes on a single RTX 4090 GPU.
Software Dependencies No The paper mentions using 'Pytorch (Paszke et al., 2019)' and 'Adam W (Loshchilov & Hutter, 2019) optimizer', but it does not specify explicit version numbers for these software libraries or any other dependencies.
Experiment Setup Yes We used the Adam W (Loshchilov & Hutter, 2019) optimizer with a cosine learning rate schedule and a 10% warmup period, with gradient clipping (1.0 threshold, decoupled weight decay of 0.1). ... Table 4 describes size-specific models and optimizer hyper-parameters for all model sizes used in this work.