QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
Authors: Andrei Panferov, Jiale Chen, Soroush Tabesh, Mahdi Nikdan, Dan Alistarh
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Llama-type architectures show that Qu EST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by Qu EST can be executed efficiently. |
| Researcher Affiliation | Collaboration | 1ISTA 2Red Hat AI. Correspondence to: Dan Alistarh <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Qu EST Training Forward; Algorithm 2 Qu EST Training Backward |
| Open Source Code | Yes | Our code is available at https: //github.com/IST-DASLab/Qu EST. |
| Open Datasets | Yes | We trained all models on tokens from the C4 (Dodge et al., 2021) dataset, tokenized with the Llama 2 tokenizer. |
| Dataset Splits | No | The paper mentions using the C4 dataset for training and shows 'C4 Val Loss', but it does not provide specific details on how the dataset was split into training, validation, or test sets (e.g., percentages, sample counts, or methodology). |
| Hardware Specification | Yes | on a single RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions using 'Pytorch (Paszke et al., 2019)' and 'Adam W (Loshchilov & Hutter, 2019) optimizer', but it does not specify explicit version numbers for these software libraries or any other dependencies. |
| Experiment Setup | Yes | We used the Adam W (Loshchilov & Hutter, 2019) optimizer with a cosine learning rate schedule and a 10% warmup period, with gradient clipping (1.0 threshold, decoupled weight decay of 0.1). ... Table 4 describes size-specific models and optimizer hyper-parameters for all model sizes used in this work. |