reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An Evolved Universal Transformer Memory

Authors: Edoardo Cetin, Qi Sun, Tianyu Zhao, Yujin Tang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate these findings across 36 different tasks from Long Bench (Bai et al., 2023), Infinite Bench (Zhang et al., 2024a), and Chou Bun1, a new Japanese benchmark designed to assess long-context capabilities beyond the common English and Chinese. These results mark a clear contrast with the aforementioned hand-designed strategies that appear to inevitably trade off efficiency for performance, in line with their stated purpose.
Researcher Affiliation	Industry	Edoardo Cetin, Qi Sun, Tianyu Zhao, Yujin Tang Sakana AI, Japan EMAIL
Pseudocode	Yes	Algorithm 1 NAMMs Algorithm 2 NAMMs
Open Source Code	Yes	Our source code is available at https://github.com/Sakana AI/evo-memory.
Open Datasets	Yes	We validate these findings across 36 different tasks from Long Bench (Bai et al., 2023), Infinite Bench (Zhang et al., 2024a), and Chou Bun1, a new Japanese benchmark designed to assess long-context capabilities beyond the common English and Chinese. We provide additional benchmark statistics, details about task composition, together with evaluation metrics for a wider range of popular LLMs in Appendix B.1. In Table 6, we provide results zero-shot transferring to the computer vision domain, evaluating NAMMs with a Llava Next Video 7B model (Zhang et al., 2024b) on Long Video Bench (Wu et al., 2024) and Multi-Task Long Video Understanding (MLVU) (Zhou et al., 2024). In Table 7, we provide our zero-shot transfer results for the offline reinforcement learning setting, where we apply NAMMs atop a decision transformer (Chen et al., 2021b) using the open-sourced models from Beeching & Simonini (2022) pre-trained on the canonical the continuous-control tasks from D4RL (Fu et al., 2020).
Dataset Splits	Yes	Our NAMM yields concrete improvements to the Llama 3 8B transformer both when considering the full set or exclusively the held-out set of test tasks that were not used for evolution, with improvements of 11% and 7% respectively. We choose three tasks from different Long Bench categories across both English and Chinese where the Llama 3 base model seems to particularly struggle: Passage Retrieval-en, Du Reader, and Narrative QA; optimizing the normalized exact match, ROUGE-L, and F1 metrics, respectively. We focus on the most popular subset of this benchmark, involving continuous-control tasks with three different agents: Hopper, Half Cheetah, and Walker-2d, evaluating the agent after pretraining on Expert, Medium, and Medium Replay data.
Hardware Specification	Yes	For our main experimental setup, we used rented cloud instances with Nvidia H100 GPUs, Intel Xeon Platinum 8481C CPUs, and 1932GB of RAM.
Software Dependencies	No	The paper mentions several models such as 'Llama 3 8B model (Dubey et al., 2024)', 'Llama 3 70B model', 'Llava Next Video 7B model (Zhang et al., 2024b)', 'Mistral 7B v0.3', and 'GPT4 model (Achiam et al., 2023)'. It also references 'CMA-ES optimization algorithm (Hansen, 2006)' and 'Flash Attention (Dao et al., 2022)'. However, it does not provide specific version numbers for general programming languages or libraries (e.g., Python, PyTorch, CUDA) that would be needed for replication.
Experiment Setup	Yes	We provide the main hyper-parameters in Table 9 and refer to either the work by Hansen (2006) or our shared code for the full implementation details. Table 9 lists: Spectrogram window size nw 32, Spectrogram window stride sw 16, Spectrogram window type Hann, Spectrogram EMA reduction coefficient γ 0.9916, Positional features 8, NAMMs execution delay 512, NAMMs non-linearity Re LU, Evolution algorithm CMA-ES, Elite ratio 0.5, Mean coefficient cm 1, Initial step size σ 0.65, Samples batch size per-task 64, Population size 32.