reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

Authors: Xinyu Yang, Tianqi Chen, Beidi Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Results on RAG and ICL tasks demonstrate that APE can preserve 98% and 93% sequential encoding performance using the same inputs while outperforming parallel encoding by 3.6% and 7.9%, respectively. It also scales to many-shot CAG, effectively encoding hundreds of contexts in parallel. Efficiency evaluation shows that APE can achieve an end-to-end 4.5 speedup by reducing 28 prefilling time for a 128K-length context.
Researcher Affiliation	Collaboration	Xinyu Yang CMU EMAIL Tianqi Chen CMU, NVIDIA EMAIL Beidi Chen CMU EMAIL
Pseudocode	Yes	def ape_attention (query , key, value , temperature , scale ): # split key and value states into context and non-context parts key_context , key_other = key value_context , value_other = value logits_context , lse_context = flash_attn (query , key, value , temperature) logits_other , lse_other = flash_attn (query , key, value) lse_context = lse_context * scale attn_weights = [ lse_context , lse_other ] attn_weights = Softmax( attn_weights ) value_states = [ logits_context , logits_other ] attn_output = attn_weights @ value_states
Open Source Code	Yes	The code is available at https://github.com/Infini-AI-Lab/APE.
Open Datasets	Yes	Setup. APE is evaluated on five conversational QA tasks using Chat RAGBench (Liu et al., 2024b). Our evaluation involves eight tasks on Long Bench (Bai et al., 2023). Setup. We evaluate APE on three ICL tasks using the LM Evaluation Harness (Gao et al., 2024):: GSM8K (8-shot) (Cobbe et al., 2021a), Trivia QA (5-shot) (Joshi et al., 2017), and MMLU (5shot) (Hendrycks et al., 2020a).
Dataset Splits	No	The paper mentions using specific benchmarks (Chat RAGBench, Long Bench, LM Evaluation Harness, GSM8K, Trivia QA, MMLU, LOFT benchmark) and varying numbers of 'shots' for in-context learning, and splitting contexts into chunks for RAG. However, it does not explicitly provide the training/test/validation dataset splits (e.g., percentages, absolute counts, or explicit reference to standard splits) for these datasets within the paper text itself, relying implicitly on the standard setups of the cited benchmarks.
Hardware Specification	Yes	Our evaluation is conducted on an H100 GPU with batch sizes of 1 and 4.
Software Dependencies	No	The paper mentions employing VLLM (Kwon et al., 2023) as an inference engine but does not specify its version number or any other software dependencies with their versions.
Experiment Setup	Yes	We tune hyperparameters on a validation set with greedy search. If no prefix is provided, we begin by adding two "\n" and increase the prefix length by 10, 20, and 40. S and T are searched in the ranges [0.1, 1.0] using 0.1 step sizes. We use S * T instead of S as the scaling factor to simplify our search. The query and generation lengths are fixed at 256 tokens