BackSlash: Rate Constrained Optimized Training of Large Language Models

Authors: Jun Wu, Jiangtao Wen, Yuxing Han

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in various architectures and tasks demonstrate that Back Slash can reduce memory usage by 60%-90% without accuracy loss and provides significant compression gain compared to compression after training.
Researcher Affiliation Academia 1Shenzhen International Graduate School, Tsinghua University 2Computer Science, New York University. Correspondence to: Yuxing Han <EMAIL>, Jiangtao Wen (project lead) <EMAIL>.
Pseudocode Yes Algorithm 1 Rate-Constrained Training (Back Slash) ... Algorithm 2 Parameter Entropy Encoding
Open Source Code No The paper does not provide any specific links to source code repositories or explicit statements about code availability.
Open Datasets Yes We perform various classification tasks on popular LLMs including BERT, GPT, Llama, and Gemma to evaluate the performances of Back Slash by classification accuracy, and generation tasks on Deep Seek evaluated by next token accuracy. ... In Table 5, we perform more classification tasks on the BERT model and generation tasks on Deep Seek model under normal training and Back Slash. The Sentiment and Spam are both binary-classification tasks and the Topic is a 20-class-classification tasks, which are evaluated by classification accuracy. The Q-A and Translation are both text generation tasks... Task Dataset Method ... Sentiment IMDB ... Spam Enron-Spam ... Topic 20 Newsgroups ... Q-A SQu AD ... Translation WMT-19
Dataset Splits No Fig. 5 demonstrates the effect of Back Slash on model accuracy. We can find that Back Slash with reasonable λ did not have a significant effect on accuracy. For the model with λ = 1000, model performance decreased by only 0.02% on the training set and 1.90% on the test set as compared with normal training (i.e. λ = 0). While the paper mentions 'training set' and 'test set', it does not specify the split ratios, methodology, or sample counts needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide any specific hardware details used for running its experiments.
Software Dependencies No The paper does not provide any specific ancillary software details with version numbers.
Experiment Setup No Algorithm 1 Rate-Constrained Training (Back Slash) 1: Require: Model f, learning rate η, loss function L, Lagrange multiplier λ, and clipping coefficient ϵ. ... Fig. 5 demonstrates the effect of Back Slash on model accuracy. We can find that Back Slash with reasonable λ did not have a significant effect on accuracy. For the model with λ = 1000, model performance decreased by only 0.02% on the training set and 1.90% on the test set as compared with normal training (i.e. λ = 0). While the algorithm lists requirements like 'learning rate η' and 'clipping coefficient ϵ', their specific values used in the experiments are not provided. The paper discusses varying 'Lagrange multiplier λ' but does not present a comprehensive set of hyperparameters for reproducibility.