Ultra-Sparse Memory Network

Authors: Zihao Huang, Qiyang Min, Hongzhi Huang, Yutao Zeng, Defa Zhu, Ran Guo, zhou Xun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, the largest Ultra Mem we train has 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget, paving the way for billions of slots or experts. [...] 5 EXPERIMENTS
Researcher Affiliation Industry Seed-Foundation-Model Team, Byte Dance EMAIL
Pseudocode No The paper describes steps in regular paragraph text and provides flow diagrams (Figure 4, Figure 5) but does not contain structured pseudocode or algorithm blocks.
Open Source Code No The text does not include an unambiguous statement about releasing code or a link to a source code repository for the methodology described in this paper.
Open Datasets Yes Training data comes from Red Pajama (Computer, 2023), containing 1 trillion tokens. Red Pajama represents a clean-room, fully open-source version of the LLa Ma (Touvron et al., 2023) dataset. Validation data includes the C4 validation set (Raffel et al., 2020), derived from the Common Crawl web corpus.
Dataset Splits Yes Training data comes from Red Pajama (Computer, 2023), containing 1 trillion tokens... Validation data includes the C4 validation set (Raffel et al., 2020)... The C4 training set is also incorporated within the Red Pajama training data.
Hardware Specification Yes The experiments in (b) and (c) are conducted on the A100-SXM-80GB. [...] The modes run on the A100-SXM.
Software Dependencies No The paper mentions software frameworks and models used (e.g., GPT-Neo X tokenizer, Megatron) but does not provide specific version numbers for key software dependencies like programming languages or libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes Training details. We used a standard pre-norm transformer... In Mo E models, two experts are activated per token... using a balance loss... weight of 0.01... In Ultra Mem models, the auxiliary loss weight is α = 0.001 and margin τ = 0.15. The learning rate for values is ten times other parameters and decays linearly. For model structure and hyperparameters details, see Appendix E, and for large-scale training optimizations, see Appendix C, D. Appendix E, Table 4, provides: Weight decay 0.1, β1 0.9, β2 0.95, LR (6e-4/2.5e-4/2e-4/1.2e-4), LR end ratio 0.1, LR schedule cosine, LR warmup ratio 0.01, Dropout 0.1, Batch size 2048, Sequence length 2048, Training step 238418. Table 5 provides: Tucker rank r 2, Multi-core scoring h 2, Virtual memory expansion E 4, Aux loss weight α 0.001, Aux loss margin τ 0.15.