Memory Mosaics

Authors: Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, Leon Bottou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Section 7 reports on medium-scale language modeling experiments. Figure 7 shows the training and validation curves of both transformers and Memory Mosaics of different depth trained on BABISTORIES. Figure 8 shows the per-token average loss as a function of the position of the generated token in the input window. Figure 9 compares Memory Mosaic on REGBENCH with the results previously reported by Aky urek et al..
Researcher Affiliation Collaboration Jianyu Zhang , Niklas Nolte , Ranajoy Sadhukhan , Beidi Chen , L eon Bottou FAIR, Meta Carnegie Mellon University New York University
Pseudocode No The paper describes the architecture and mechanisms using equations (e.g., Equations 2, 3, 4, 5, 6, 7) and architectural diagrams (Figures 1, 3, 12), but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We share the BABISTORIES dataset and Memory Mosaics source code at https://github.com/ facebookresearch/Memory Mosaics.
Open Datasets Yes We share the BABISTORIES dataset and Memory Mosaics source code at https://github.com/ facebookresearch/Memory Mosaics. The TINYSTORIES dataset (Eldan & Li, 2023) is composed of stories written in a simple language...
Dataset Splits Yes Table 1: BABISTORIES statistics. dataset partition #stories #tokens (GPT2 tokenizer) #char per story (average) train 2.2M 474,704,907 888 valid 2.2k 4,749,107 889
Hardware Specification Yes Models were trained on 64 NVidia V100 GPUs over 80k epochs. From conception to finalization of this paper we trained about 200 models. To create the Babi Stories dataset via Mistral, we ran with 128 NVidia V100 GPUs for 3 days. The supporting machines contain Intel(R) Xeon(R) Gold 6230 CPUs. The 3 moons result took negligible resources and were trained on Apple M1 laptops.
Software Dependencies No The paper mentions the use of AdamW optimizer and the GPT2 tokenizer, but it does not specify version numbers for any key software components such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup Yes Table 2 showcases the hyper-parameters searching process of GPT2 transformer baseline on the BABISTORIES dataset, where we use Adam W optimizer Loshchilov & Hutter (2017), batch-size 512, context-size 512, and a cosine learning rate scheduler with minimum learning rate 1e 4 for all training. ... Both architectures, shown side-by-side in Figure 6, use the same GPT2 tokenizer, the same embedding dimension (d = 768), and the same number of heads (Nh = Nc = Np = 12).