reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Memory Mosaics

Authors: Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, Leon Bottou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Section 7 reports on medium-scale language modeling experiments. Figure 7 shows the training and validation curves of both transformers and Memory Mosaics of different depth trained on BABISTORIES. Figure 8 shows the per-token average loss as a function of the position of the generated token in the input window. Figure 9 compares Memory Mosaic on REGBENCH with the results previously reported by Aky urek et al..
Researcher Affiliation	Collaboration	Jianyu Zhang , Niklas Nolte , Ranajoy Sadhukhan , Beidi Chen , L eon Bottou FAIR, Meta Carnegie Mellon University New York University
Pseudocode	No	The paper describes the architecture and mechanisms using equations (e.g., Equations 2, 3, 4, 5, 6, 7) and architectural diagrams (Figures 1, 3, 12), but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We share the BABISTORIES dataset and Memory Mosaics source code at https://github.com/ facebookresearch/Memory Mosaics.
Open Datasets	Yes	We share the BABISTORIES dataset and Memory Mosaics source code at https://github.com/ facebookresearch/Memory Mosaics. The TINYSTORIES dataset (Eldan & Li, 2023) is composed of stories written in a simple language...
Dataset Splits	Yes	Table 1: BABISTORIES statistics. dataset partition #stories #tokens (GPT2 tokenizer) #char per story (average) train 2.2M 474,704,907 888 valid 2.2k 4,749,107 889
Hardware Specification	Yes	Models were trained on 64 NVidia V100 GPUs over 80k epochs. From conception to finalization of this paper we trained about 200 models. To create the Babi Stories dataset via Mistral, we ran with 128 NVidia V100 GPUs for 3 days. The supporting machines contain Intel(R) Xeon(R) Gold 6230 CPUs. The 3 moons result took negligible resources and were trained on Apple M1 laptops.
Software Dependencies	No	The paper mentions the use of AdamW optimizer and the GPT2 tokenizer, but it does not specify version numbers for any key software components such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup	Yes	Table 2 showcases the hyper-parameters searching process of GPT2 transformer baseline on the BABISTORIES dataset, where we use Adam W optimizer Loshchilov & Hutter (2017), batch-size 512, context-size 512, and a cosine learning rate scheduler with minimum learning rate 1e 4 for all training. ... Both architectures, shown side-by-side in Figure 6, use the same GPT2 tokenizer, the same embedding dimension (d = 768), and the same number of heads (Nh = Nc = Np = 12).