ZETA: Leveraging $Z$-order Curves for Efficient Top-$k$ Attention

Authors: Qiuhao Zeng, Jierui Huang, Peng Lu, Gezheng Xu, Boxing Chen, Charles Ling, Boyu Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that ZETA matches the performance of standard attention on the synthetic MULTI-QUERY ASSOCIATIVE RECALL task and outperforms attention and its variants on LONG RANGE ARENA and WIKITEXT-103 language modeling.
Researcher Affiliation Collaboration University of Western Ontario Universit e de Montr eal Mila Noah s Ark Lab Vector Institute
Pseudocode Yes The pseudo-code in Algorithm 1 outlines the ZETA Top-k Attention mechanism, which combines Z-order curve projections with chunk-based sorting to efficiently identify and retrieve the top-k nearest neighbors while maintaining causal constraints.
Open Source Code No The paper mentions "Our implementation is based on Triton." and discusses its optimization but does not provide an explicit statement or link for the public release of their source code for ZETA.
Open Datasets Yes We evaluate ZETA s performance on several aspects: ZETA s ability to solve the synthetic MULTI-QUERY ASSOCIATIVE RECALL task (Arora et al., 2024a), long sequence modeling ability on the LONG RANGE ARENA (LRA) benchmark and auto-regressive language modeling on WIKITEXT-103.
Dataset Splits Yes For each model, we adopt the same hyperparameter settings provided by the official LRA benchmark (Tay et al., 2021) to ensure a fair comparison.
Hardware Specification No The paper mentions the use of "GPUs" and "Triton" for implementation and efficiency benchmarking but does not specify any particular GPU models, CPU models, or other specific hardware configurations used for experiments.
Software Dependencies No The paper mentions "Py Torch (Paszke et al., 2019)" and "Triton" but does not provide specific version numbers for these software components or any other libraries.
Experiment Setup Yes The ZETA model configuration generally involves setting the number of chunks to values such as 4, 8, 16, 32 depending on the sequence length... The hidden dimension, d V , is typically set to 256 or 512 with 8 attention heads when working with LRA datasets. However, for larger and more complex datasets such as WIKITEXT-103, the hidden dimension is increased to d V = 768 with 12 attention heads... Additionally, the dimensions of keys and queries are kept significantly lower at d K = d Q = 3... In most of our experiments, we set k = 32...