TicketLLM: Next-Generation Sparse and Low-bit Transformers with Supermask-based Method
Authors: Yasuyuki Okoshi, Hikari Otsuka, Daichi Fujiki, Masato Motomura
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Ada-Sup can discover high-quality supermasks with significantly reduced training costs compared to previous methods in both binary and multi-bit settings. Furthermore, Ticket LLM outperforms Bit Net b1.58 on a 1.3B parameter model with the same memory per connection, achieving 0.6% reduction in perplexity (from 13.62 to 13.54) while operating at a higher sparsity level (around 50% vs. around 33%). These results highlight the potential of supermask-based methods as a promising approach for building lightweight LLMs. Code is available: https://github.com/yasu0001/Ticket LLM. ... 4 Evaluation |
| Researcher Affiliation | Academia | Yasuyuki Okoshi EMAIL AI Computing Research Unit Institute of Science Tokyo Hikari Otsuka EMAIL AI Computing Research Unit Institute of Science Tokyo Daichi Fujiki EMAIL AI Computing Research Unit Institute of Science Tokyo Masato Motomura EMAIL AI Computing Research Unit Institute of Science Tokyo |
| Pseudocode | No | The paper describes methods through textual descriptions and mathematical equations (e.g., Eq. 1, 2, 4, 5, 6), and visual diagrams (Figure 3: Overview of supermask generation methods). However, it does not include a dedicated section or block explicitly labeled as "Pseudocode" or "Algorithm" with structured, code-like steps. |
| Open Source Code | Yes | Code is available: https://github.com/yasu0001/Ticket LLM. |
| Open Datasets | Yes | Transformer models are trained on randomly sampled subsets from Fine Web-Edu (Penedo et al., 2024) and evaluated on C4 validation dataset (Raffel et al., 2020). |
| Dataset Splits | Yes | Transformer models are trained on randomly sampled subsets from Fine Web-Edu (Penedo et al., 2024) and evaluated on C4 validation dataset (Raffel et al., 2020). Both datasets are tokenized using the LLa MA2 tokenizer (Touvron et al., 2023), whose vocabulary size is 32K. In order to ensure consistent training, tokens are concatenated into sequences of length 2048, where shorter sequences are combined, and longer sequences are truncated. |
| Hardware Specification | Yes | Execution time is measured over 100 iterations using an NVIDIA Ge Force RTX 3090. ... On 700M-parameter models trained with 20 TPPs, Ada-Sup takes a training time of approximately 40 H100 GPU hours for both 2-bit and 3-bit supermasks. ... This work was carried out using the TSUBAME4.0 supercomputer at Institute of Science Tokyo. |
| Software Dependencies | No | Execution time is measured using the native Py Torch implementation (without custom CUDA kernels or third-party optimizations). ... All training and evaluation experiments are conducted using the LLM Foundry (Mosaic ML, 2023). |
| Experiment Setup | Yes | Scores are initialized using a normal distribution with a standard deviation of 0.02. We increase the number of training tokens for pre-training following a ratio of tokens per model parameters (TPP). Models are optimized with decoupled weight decay (Adam W) (Loshchilov & Hutter, 2019), setting β1 = 0.95, β2 = 0.99, and a weight decay of 0.1. The maximum learning rate is scaled down with increasing parameters, according to Kaplan et al. (2020). It linearly decays to zero after the learning rate warms up in the first 1% of the total number of iterations. ... The batch size is 512, with gradient accumulation employed for larger models. Gradient clipping with 1.0 is also applied to stabilize training. All hyperparameters are summarized in Table 5. |