reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

Authors: Andrei Panferov, Jiale Chen, Soroush Tabesh, Mahdi Nikdan, Dan Alistarh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Llama-type architectures show that Qu EST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by Qu EST can be executed efficiently.
Researcher Affiliation	Collaboration	1ISTA 2Red Hat AI. Correspondence to: Dan Alistarh <EMAIL>.
Pseudocode	Yes	Algorithm 1 Qu EST Training Forward; Algorithm 2 Qu EST Training Backward
Open Source Code	Yes	Our code is available at https: //github.com/IST-DASLab/Qu EST.
Open Datasets	Yes	We trained all models on tokens from the C4 (Dodge et al., 2021) dataset, tokenized with the Llama 2 tokenizer.
Dataset Splits	No	The paper mentions using the C4 dataset for training and shows 'C4 Val Loss', but it does not provide specific details on how the dataset was split into training, validation, or test sets (e.g., percentages, sample counts, or methodology).
Hardware Specification	Yes	on a single RTX 4090 GPU.
Software Dependencies	No	The paper mentions using 'Pytorch (Paszke et al., 2019)' and 'Adam W (Loshchilov & Hutter, 2019) optimizer', but it does not specify explicit version numbers for these software libraries or any other dependencies.
Experiment Setup	Yes	We used the Adam W (Loshchilov & Hutter, 2019) optimizer with a cosine learning rate schedule and a 10% warmup period, with gradient clipping (1.0 threshold, decoupled weight decay of 0.1). ... Table 4 describes size-specific models and optimizer hyper-parameters for all model sizes used in this work.