APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding
Authors: Xinyu Yang, Tianqi Chen, Beidi Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results on RAG and ICL tasks demonstrate that APE can preserve 98% and 93% sequential encoding performance using the same inputs while outperforming parallel encoding by 3.6% and 7.9%, respectively. It also scales to many-shot CAG, effectively encoding hundreds of contexts in parallel. Efficiency evaluation shows that APE can achieve an end-to-end 4.5 speedup by reducing 28 prefilling time for a 128K-length context. |
| Researcher Affiliation | Collaboration | Xinyu Yang CMU EMAIL Tianqi Chen CMU, NVIDIA EMAIL Beidi Chen CMU EMAIL |
| Pseudocode | Yes | def ape_attention (query , key, value , temperature , scale ): # split key and value states into context and non-context parts key_context , key_other = key value_context , value_other = value logits_context , lse_context = flash_attn (query , key, value , temperature) logits_other , lse_other = flash_attn (query , key, value) lse_context = lse_context * scale attn_weights = [ lse_context , lse_other ] attn_weights = Softmax( attn_weights ) value_states = [ logits_context , logits_other ] attn_output = attn_weights @ value_states |
| Open Source Code | Yes | The code is available at https://github.com/Infini-AI-Lab/APE. |
| Open Datasets | Yes | Setup. APE is evaluated on five conversational QA tasks using Chat RAGBench (Liu et al., 2024b). Our evaluation involves eight tasks on Long Bench (Bai et al., 2023). Setup. We evaluate APE on three ICL tasks using the LM Evaluation Harness (Gao et al., 2024):: GSM8K (8-shot) (Cobbe et al., 2021a), Trivia QA (5-shot) (Joshi et al., 2017), and MMLU (5shot) (Hendrycks et al., 2020a). |
| Dataset Splits | No | The paper mentions using specific benchmarks (Chat RAGBench, Long Bench, LM Evaluation Harness, GSM8K, Trivia QA, MMLU, LOFT benchmark) and varying numbers of 'shots' for in-context learning, and splitting contexts into chunks for RAG. However, it does not explicitly provide the training/test/validation dataset splits (e.g., percentages, absolute counts, or explicit reference to standard splits) for these datasets within the paper text itself, relying implicitly on the standard setups of the cited benchmarks. |
| Hardware Specification | Yes | Our evaluation is conducted on an H100 GPU with batch sizes of 1 and 4. |
| Software Dependencies | No | The paper mentions employing VLLM (Kwon et al., 2023) as an inference engine but does not specify its version number or any other software dependencies with their versions. |
| Experiment Setup | Yes | We tune hyperparameters on a validation set with greedy search. If no prefix is provided, we begin by adding two "\n" and increase the prefix length by 10, 20, and 40. S and T are searched in the ranges [0.1, 1.0] using 0.1 step sizes. We use S * T instead of S as the scaling factor to simplify our search. The query and generation lengths are fixed at 256 tokens |