reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Surprising Effectiveness of pretraining Ternary Language Model at Scale

Authors: Ayush Kaushal, Tejas Vaidhya, Arnab Mondal, Tejas Pandey, Aaryan Bhagat, Irina Rish

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive evaluation demonstrates that Tri LMs offer superior scaling behavior in terms of model size (in bits). Surprisingly, at scales exceeding one billion parameters, Tri LMs consistently outperform their Quant LM and Float LM counterparts for a given bit size across various benchmarks. Notably, the 3.9B parameter Tri LM matches the performance of the Float LM 3.9B across all benchmarks, despite having fewer bits than Float LM 830M.
Researcher Affiliation	Collaboration	Ayush Kaushal1 2 , Tejas Vaidhya 1 4 , Arnab Kumar Mondal 4, Tejas Pandey1 3, Aaryan Bhagat 5, Irina Rish 1 2 4 1Nolano AI 2University of Montreal 3IIT Kharagpur 4Mila Quebec AI Institute 5UC Riverside
Pseudocode	Yes	Figure 4: The computational flow of the forward, backward, and inference processes in Tri LM s linear layer with N-Way model parallelism is shown on the left. Additionally, we provide the equations (on left) for the forward pass during the training of our Float LM, Tri LM, and Bi LM (for details, see 1). A.1 FORWARD PASS, BACKWARD PASS AND INFERENCE EQUATIONS Table 1 shows the equations across Tri LM vs Float LM for forward pass, backward pass and inference.
Open Source Code	No	The paper states: "We present Spectra, the first open suite of LLMs spanning many bit-widths." and "Training data as well as intermediate training checkpoints of Tri LMs and Float Lms are publicly available for future research." While the models and checkpoints are open, there is no explicit statement or link confirming the release of the source code for the methodology described in the paper.
Open Datasets	Yes	All the Tri LMs and Float LMs are trained on identical data sequences, specifically a 300B subset of Slim Pajama (Soboleva et al., 2023) dataset (see Appendix A.2)... We also make this subset public
Dataset Splits	No	The paper states: "Main experiments (Spectra suite): We used the full 300B token sample." and "Ablation studies: Training runs with 100B tokens, we sample from these 300B tokens with equal probability weight to each data-point." It mentions a validation loss, but does not provide specific details on how the dataset was split into training, validation, or testing sets, nor the sizes or percentages of these splits.
Hardware Specification	Yes	We train on nodes with IBM Power9 PC CPUs and 6x16GB V100.
Software Dependencies	No	The paper mentions using "Adam W (Kingma & Ba, 2017) for optimization" and that "Our implementation was based on GPT Neo X Codebase (Andonian et al., 2023)." It also states, "We extensively use Huggingface (Wolf et al., 2020) and Wandb (Biewald, 2020) for handling the checkpoints and experiment tracking." However, it does not provide specific version numbers for any of these software components or programming languages used.
Experiment Setup	Yes	Table 3 shows the hyperparameters for Tri LM and Float LM s transformer architecture and their learning rate. We set Adam β are set to (0.9, 0.95) for both families of models and all the reported runs are trained to 2048 sequence length. Float LM and Tri LM are respectively trained with batch sizes of 2M and 1M tokens respectively. Our optimization schedule for Tri LM incorporates two key interventions within a standard linear decay learning rate schedule with warmup and weight decay (L2 regularization). First, we reduce the peak learning rate at approximately the halfway point of training. Second, we remove the weight decay regularization about two-thirds into the training