reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TicketLLM: Next-Generation Sparse and Low-bit Transformers with Supermask-based Method

Authors: Yasuyuki Okoshi, Hikari Otsuka, Daichi Fujiki, Masato Motomura

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Ada-Sup can discover high-quality supermasks with significantly reduced training costs compared to previous methods in both binary and multi-bit settings. Furthermore, Ticket LLM outperforms Bit Net b1.58 on a 1.3B parameter model with the same memory per connection, achieving 0.6% reduction in perplexity (from 13.62 to 13.54) while operating at a higher sparsity level (around 50% vs. around 33%). These results highlight the potential of supermask-based methods as a promising approach for building lightweight LLMs. Code is available: https://github.com/yasu0001/Ticket LLM. ... 4 Evaluation
Researcher Affiliation	Academia	Yasuyuki Okoshi EMAIL AI Computing Research Unit Institute of Science Tokyo Hikari Otsuka EMAIL AI Computing Research Unit Institute of Science Tokyo Daichi Fujiki EMAIL AI Computing Research Unit Institute of Science Tokyo Masato Motomura EMAIL AI Computing Research Unit Institute of Science Tokyo
Pseudocode	No	The paper describes methods through textual descriptions and mathematical equations (e.g., Eq. 1, 2, 4, 5, 6), and visual diagrams (Figure 3: Overview of supermask generation methods). However, it does not include a dedicated section or block explicitly labeled as "Pseudocode" or "Algorithm" with structured, code-like steps.
Open Source Code	Yes	Code is available: https://github.com/yasu0001/Ticket LLM.
Open Datasets	Yes	Transformer models are trained on randomly sampled subsets from Fine Web-Edu (Penedo et al., 2024) and evaluated on C4 validation dataset (Raffel et al., 2020).
Dataset Splits	Yes	Transformer models are trained on randomly sampled subsets from Fine Web-Edu (Penedo et al., 2024) and evaluated on C4 validation dataset (Raffel et al., 2020). Both datasets are tokenized using the LLa MA2 tokenizer (Touvron et al., 2023), whose vocabulary size is 32K. In order to ensure consistent training, tokens are concatenated into sequences of length 2048, where shorter sequences are combined, and longer sequences are truncated.
Hardware Specification	Yes	Execution time is measured over 100 iterations using an NVIDIA Ge Force RTX 3090. ... On 700M-parameter models trained with 20 TPPs, Ada-Sup takes a training time of approximately 40 H100 GPU hours for both 2-bit and 3-bit supermasks. ... This work was carried out using the TSUBAME4.0 supercomputer at Institute of Science Tokyo.
Software Dependencies	No	Execution time is measured using the native Py Torch implementation (without custom CUDA kernels or third-party optimizations). ... All training and evaluation experiments are conducted using the LLM Foundry (Mosaic ML, 2023).
Experiment Setup	Yes	Scores are initialized using a normal distribution with a standard deviation of 0.02. We increase the number of training tokens for pre-training following a ratio of tokens per model parameters (TPP). Models are optimized with decoupled weight decay (Adam W) (Loshchilov & Hutter, 2019), setting β1 = 0.95, β2 = 0.99, and a weight decay of 0.1. The maximum learning rate is scaled down with increasing parameters, according to Kaplan et al. (2020). It linearly decays to zero after the learning rate warms up in the first 1% of the total number of iterations. ... The batch size is 512, with gradient accumulation employed for larger models. Gradient clipping with 1.0 is also applied to stabilize training. All hyperparameters are summarized in Table 5.