An Evolved Universal Transformer Memory
Authors: Edoardo Cetin, Qi Sun, Tianyu Zhao, Yujin Tang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate these findings across 36 different tasks from Long Bench (Bai et al., 2023), Infinite Bench (Zhang et al., 2024a), and Chou Bun1, a new Japanese benchmark designed to assess long-context capabilities beyond the common English and Chinese. These results mark a clear contrast with the aforementioned hand-designed strategies that appear to inevitably trade off efficiency for performance, in line with their stated purpose. |
| Researcher Affiliation | Industry | Edoardo Cetin, Qi Sun, Tianyu Zhao, Yujin Tang Sakana AI, Japan EMAIL |
| Pseudocode | Yes | Algorithm 1 NAMMs Algorithm 2 NAMMs |
| Open Source Code | Yes | Our source code is available at https://github.com/Sakana AI/evo-memory. |
| Open Datasets | Yes | We validate these findings across 36 different tasks from Long Bench (Bai et al., 2023), Infinite Bench (Zhang et al., 2024a), and Chou Bun1, a new Japanese benchmark designed to assess long-context capabilities beyond the common English and Chinese. We provide additional benchmark statistics, details about task composition, together with evaluation metrics for a wider range of popular LLMs in Appendix B.1. In Table 6, we provide results zero-shot transferring to the computer vision domain, evaluating NAMMs with a Llava Next Video 7B model (Zhang et al., 2024b) on Long Video Bench (Wu et al., 2024) and Multi-Task Long Video Understanding (MLVU) (Zhou et al., 2024). In Table 7, we provide our zero-shot transfer results for the offline reinforcement learning setting, where we apply NAMMs atop a decision transformer (Chen et al., 2021b) using the open-sourced models from Beeching & Simonini (2022) pre-trained on the canonical the continuous-control tasks from D4RL (Fu et al., 2020). |
| Dataset Splits | Yes | Our NAMM yields concrete improvements to the Llama 3 8B transformer both when considering the full set or exclusively the held-out set of test tasks that were not used for evolution, with improvements of 11% and 7% respectively. We choose three tasks from different Long Bench categories across both English and Chinese where the Llama 3 base model seems to particularly struggle: Passage Retrieval-en, Du Reader, and Narrative QA; optimizing the normalized exact match, ROUGE-L, and F1 metrics, respectively. We focus on the most popular subset of this benchmark, involving continuous-control tasks with three different agents: Hopper, Half Cheetah, and Walker-2d, evaluating the agent after pretraining on Expert, Medium, and Medium Replay data. |
| Hardware Specification | Yes | For our main experimental setup, we used rented cloud instances with Nvidia H100 GPUs, Intel Xeon Platinum 8481C CPUs, and 1932GB of RAM. |
| Software Dependencies | No | The paper mentions several models such as 'Llama 3 8B model (Dubey et al., 2024)', 'Llama 3 70B model', 'Llava Next Video 7B model (Zhang et al., 2024b)', 'Mistral 7B v0.3', and 'GPT4 model (Achiam et al., 2023)'. It also references 'CMA-ES optimization algorithm (Hansen, 2006)' and 'Flash Attention (Dao et al., 2022)'. However, it does not provide specific version numbers for general programming languages or libraries (e.g., Python, PyTorch, CUDA) that would be needed for replication. |
| Experiment Setup | Yes | We provide the main hyper-parameters in Table 9 and refer to either the work by Hansen (2006) or our shared code for the full implementation details. Table 9 lists: Spectrogram window size nw 32, Spectrogram window stride sw 16, Spectrogram window type Hann, Spectrogram EMA reduction coefficient γ 0.9916, Positional features 8, NAMMs execution delay 512, NAMMs non-linearity Re LU, Evolution algorithm CMA-ES, Elite ratio 0.5, Mean coefficient cm 1, Initial step size σ 0.65, Samples batch size per-task 64, Population size 32. |