APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

Authors: Xinyu Yang, Tianqi Chen, Beidi Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results on RAG and ICL tasks demonstrate that APE can preserve 98% and 93% sequential encoding performance using the same inputs while outperforming parallel encoding by 3.6% and 7.9%, respectively. It also scales to many-shot CAG, effectively encoding hundreds of contexts in parallel. Efficiency evaluation shows that APE can achieve an end-to-end 4.5 speedup by reducing 28 prefilling time for a 128K-length context.
Researcher Affiliation Collaboration Xinyu Yang CMU EMAIL Tianqi Chen CMU, NVIDIA EMAIL Beidi Chen CMU EMAIL
Pseudocode Yes def ape_attention (query , key, value , temperature , scale ): # split key and value states into context and non-context parts key_context , key_other = key value_context , value_other = value logits_context , lse_context = flash_attn (query , key, value , temperature) logits_other , lse_other = flash_attn (query , key, value) lse_context = lse_context * scale attn_weights = [ lse_context , lse_other ] attn_weights = Softmax( attn_weights ) value_states = [ logits_context , logits_other ] attn_output = attn_weights @ value_states
Open Source Code Yes The code is available at https://github.com/Infini-AI-Lab/APE.
Open Datasets Yes Setup. APE is evaluated on five conversational QA tasks using Chat RAGBench (Liu et al., 2024b). Our evaluation involves eight tasks on Long Bench (Bai et al., 2023). Setup. We evaluate APE on three ICL tasks using the LM Evaluation Harness (Gao et al., 2024):: GSM8K (8-shot) (Cobbe et al., 2021a), Trivia QA (5-shot) (Joshi et al., 2017), and MMLU (5shot) (Hendrycks et al., 2020a).
Dataset Splits No The paper mentions using specific benchmarks (Chat RAGBench, Long Bench, LM Evaluation Harness, GSM8K, Trivia QA, MMLU, LOFT benchmark) and varying numbers of 'shots' for in-context learning, and splitting contexts into chunks for RAG. However, it does not explicitly provide the training/test/validation dataset splits (e.g., percentages, absolute counts, or explicit reference to standard splits) for these datasets within the paper text itself, relying implicitly on the standard setups of the cited benchmarks.
Hardware Specification Yes Our evaluation is conducted on an H100 GPU with batch sizes of 1 and 4.
Software Dependencies No The paper mentions employing VLLM (Kwon et al., 2023) as an inference engine but does not specify its version number or any other software dependencies with their versions.
Experiment Setup Yes We tune hyperparameters on a validation set with greedy search. If no prefix is provided, we begin by adding two "\n" and increase the prefix length by 10, 20, and 40. S and T are searched in the ranges [0.1, 1.0] using 0.1 step sizes. We use S * T instead of S as the scaling factor to simplify our search. The query and generation lengths are fixed at 256 tokens