Surprising Effectiveness of pretraining Ternary Language Model at Scale
Authors: Ayush Kaushal, Tejas Vaidhya, Arnab Mondal, Tejas Pandey, Aaryan Bhagat, Irina Rish
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive evaluation demonstrates that Tri LMs offer superior scaling behavior in terms of model size (in bits). Surprisingly, at scales exceeding one billion parameters, Tri LMs consistently outperform their Quant LM and Float LM counterparts for a given bit size across various benchmarks. Notably, the 3.9B parameter Tri LM matches the performance of the Float LM 3.9B across all benchmarks, despite having fewer bits than Float LM 830M. |
| Researcher Affiliation | Collaboration | Ayush Kaushal1 2 , Tejas Vaidhya 1 4 , Arnab Kumar Mondal 4, Tejas Pandey1 3, Aaryan Bhagat 5, Irina Rish 1 2 4 1Nolano AI 2University of Montreal 3IIT Kharagpur 4Mila Quebec AI Institute 5UC Riverside |
| Pseudocode | Yes | Figure 4: The computational flow of the forward, backward, and inference processes in Tri LM s linear layer with N-Way model parallelism is shown on the left. Additionally, we provide the equations (on left) for the forward pass during the training of our Float LM, Tri LM, and Bi LM (for details, see 1). A.1 FORWARD PASS, BACKWARD PASS AND INFERENCE EQUATIONS Table 1 shows the equations across Tri LM vs Float LM for forward pass, backward pass and inference. |
| Open Source Code | No | The paper states: "We present Spectra, the first open suite of LLMs spanning many bit-widths." and "Training data as well as intermediate training checkpoints of Tri LMs and Float Lms are publicly available for future research." While the models and checkpoints are open, there is no explicit statement or link confirming the release of the source code for the methodology described in the paper. |
| Open Datasets | Yes | All the Tri LMs and Float LMs are trained on identical data sequences, specifically a 300B subset of Slim Pajama (Soboleva et al., 2023) dataset (see Appendix A.2)... We also make this subset public |
| Dataset Splits | No | The paper states: "Main experiments (Spectra suite): We used the full 300B token sample." and "Ablation studies: Training runs with 100B tokens, we sample from these 300B tokens with equal probability weight to each data-point." It mentions a validation loss, but does not provide specific details on how the dataset was split into training, validation, or testing sets, nor the sizes or percentages of these splits. |
| Hardware Specification | Yes | We train on nodes with IBM Power9 PC CPUs and 6x16GB V100. |
| Software Dependencies | No | The paper mentions using "Adam W (Kingma & Ba, 2017) for optimization" and that "Our implementation was based on GPT Neo X Codebase (Andonian et al., 2023)." It also states, "We extensively use Huggingface (Wolf et al., 2020) and Wandb (Biewald, 2020) for handling the checkpoints and experiment tracking." However, it does not provide specific version numbers for any of these software components or programming languages used. |
| Experiment Setup | Yes | Table 3 shows the hyperparameters for Tri LM and Float LM s transformer architecture and their learning rate. We set Adam β are set to (0.9, 0.95) for both families of models and all the reported runs are trained to 2048 sequence length. Float LM and Tri LM are respectively trained with batch sizes of 2M and 1M tokens respectively. Our optimization schedule for Tri LM incorporates two key interventions within a standard linear decay learning rate schedule with warmup and weight decay (L2 regularization). First, we reduce the peak learning rate at approximately the halfway point of training. Second, we remove the weight decay regularization about two-thirds into the training |