Ultra-Sparse Memory Network
Authors: Zihao Huang, Qiyang Min, Hongzhi Huang, Yutao Zeng, Defa Zhu, Ran Guo, zhou Xun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, the largest Ultra Mem we train has 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget, paving the way for billions of slots or experts. [...] 5 EXPERIMENTS |
| Researcher Affiliation | Industry | Seed-Foundation-Model Team, Byte Dance EMAIL |
| Pseudocode | No | The paper describes steps in regular paragraph text and provides flow diagrams (Figure 4, Figure 5) but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The text does not include an unambiguous statement about releasing code or a link to a source code repository for the methodology described in this paper. |
| Open Datasets | Yes | Training data comes from Red Pajama (Computer, 2023), containing 1 trillion tokens. Red Pajama represents a clean-room, fully open-source version of the LLa Ma (Touvron et al., 2023) dataset. Validation data includes the C4 validation set (Raffel et al., 2020), derived from the Common Crawl web corpus. |
| Dataset Splits | Yes | Training data comes from Red Pajama (Computer, 2023), containing 1 trillion tokens... Validation data includes the C4 validation set (Raffel et al., 2020)... The C4 training set is also incorporated within the Red Pajama training data. |
| Hardware Specification | Yes | The experiments in (b) and (c) are conducted on the A100-SXM-80GB. [...] The modes run on the A100-SXM. |
| Software Dependencies | No | The paper mentions software frameworks and models used (e.g., GPT-Neo X tokenizer, Megatron) but does not provide specific version numbers for key software dependencies like programming languages or libraries (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | Training details. We used a standard pre-norm transformer... In Mo E models, two experts are activated per token... using a balance loss... weight of 0.01... In Ultra Mem models, the auxiliary loss weight is α = 0.001 and margin τ = 0.15. The learning rate for values is ten times other parameters and decays linearly. For model structure and hyperparameters details, see Appendix E, and for large-scale training optimizations, see Appendix C, D. Appendix E, Table 4, provides: Weight decay 0.1, β1 0.9, β2 0.95, LR (6e-4/2.5e-4/2e-4/1.2e-4), LR end ratio 0.1, LR schedule cosine, LR warmup ratio 0.01, Dropout 0.1, Batch size 2048, Sequence length 2048, Training step 238418. Table 5 provides: Tucker rank r 2, Multi-core scoring h 2, Virtual memory expansion E 4, Aux loss weight α 0.001, Aux loss margin τ 0.15. |