reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Ultra-Sparse Memory Network

Authors: Zihao Huang, Qiyang Min, Hongzhi Huang, Yutao Zeng, Defa Zhu, Ran Guo, zhou Xun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, the largest Ultra Mem we train has 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget, paving the way for billions of slots or experts. [...] 5 EXPERIMENTS
Researcher Affiliation	Industry	Seed-Foundation-Model Team, Byte Dance EMAIL
Pseudocode	No	The paper describes steps in regular paragraph text and provides flow diagrams (Figure 4, Figure 5) but does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The text does not include an unambiguous statement about releasing code or a link to a source code repository for the methodology described in this paper.
Open Datasets	Yes	Training data comes from Red Pajama (Computer, 2023), containing 1 trillion tokens. Red Pajama represents a clean-room, fully open-source version of the LLa Ma (Touvron et al., 2023) dataset. Validation data includes the C4 validation set (Raffel et al., 2020), derived from the Common Crawl web corpus.
Dataset Splits	Yes	Training data comes from Red Pajama (Computer, 2023), containing 1 trillion tokens... Validation data includes the C4 validation set (Raffel et al., 2020)... The C4 training set is also incorporated within the Red Pajama training data.
Hardware Specification	Yes	The experiments in (b) and (c) are conducted on the A100-SXM-80GB. [...] The modes run on the A100-SXM.
Software Dependencies	No	The paper mentions software frameworks and models used (e.g., GPT-Neo X tokenizer, Megatron) but does not provide specific version numbers for key software dependencies like programming languages or libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	Training details. We used a standard pre-norm transformer... In Mo E models, two experts are activated per token... using a balance loss... weight of 0.01... In Ultra Mem models, the auxiliary loss weight is α = 0.001 and margin τ = 0.15. The learning rate for values is ten times other parameters and decays linearly. For model structure and hyperparameters details, see Appendix E, and for large-scale training optimizations, see Appendix C, D. Appendix E, Table 4, provides: Weight decay 0.1, β1 0.9, β2 0.95, LR (6e-4/2.5e-4/2e-4/1.2e-4), LR end ratio 0.1, LR schedule cosine, LR warmup ratio 0.01, Dropout 0.1, Batch size 2048, Sequence length 2048, Training step 238418. Table 5 provides: Tucker rank r 2, Multi-core scoring h 2, Virtual memory expansion E 4, Aux loss weight α 0.001, Aux loss margin τ 0.15.