Deliberation in Latent Space via Differentiable Cache Augmentation

Authors: Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently improves performance across a range of reasoning-intensive tasks.
Researcher Affiliation Industry 1Google Deep Mind. Correspondence to: Luyang Liu <EMAIL>, Arthur Szlam <EMAIL>.
Pseudocode No The paper describes methods and processes but does not include any explicitly labeled pseudocode or algorithm blocks. The figures illustrate architectural concepts rather than step-by-step algorithms.
Open Source Code No The paper does not contain any explicit statements about releasing code, nor does it provide links to any repositories. Phrases like "We release our code" or similar are absent.
Open Datasets Yes We evaluate our method using Gemma-2 (Team-Gemma et al., 2024) models pretrained on a diverse dataset mixture. Our augmented Gemma-2 models... are trained on the same 2 trillion token, primarily English dataset used for Gemma-2 pretraining (Team Gemma et al., 2024) We evaluated cache augmentation on a range of public benchmarks spanning natural language understanding and reasoning tasks (Table 2). In this setting, we only call the coprocessor once, at the end of the prompt. Our method consistently improves performance compared to the baseline frozen Gemma-2 2B model, with particularly substantial gains on reasoning-intensive benchmarks. Several tasks, including MMLU, GSM8K, Trivia QA, NQ, and MATH, exhibit a strong correlation between the number of latent embeddings and performance improvement.
Dataset Splits No We evaluate this using a proprietary validation dataset (Same as the one used in Gemma (Team-Gemma et al., 2024))... We used Lo RA finetuning (with a rank of 128) on both the baseline model and our augmented model. For the baseline, Lo RA was applied directly to the base LLM, while for our augmented model, Lo RA was applied specifically to the coprocessor, leaving the base LLM frozen. This improvement is likely attributable to the strong regularization imposed by keeping the base LLM frozen during coprocessor training. This freezing prevents overfitting to the relatively small downstream datasets, allowing the coprocessor to effectively learn task-specific reasoning patterns without disrupting the general knowledge encoded in the pretrained LLM.
Hardware Specification No The paper mentions using "Gemma-2 2B model" and "Gemma-2 9B model" as the base LLMs, but it does not specify any hardware (e.g., GPU, CPU models, or cloud computing platforms) used for training or inference.
Software Dependencies No The paper does not provide specific version numbers for any software, libraries, or frameworks used in their experiments. It describes the methodology and models but omits these details.
Experiment Setup Yes We trained the model for 100,000 steps using a batch size of 1024, packed sequences of length 2048, 16 ahead tokens (NA), and 128 randomly sampled augmentation positions ( traces ) for all training experiments. Importantly, no task-specific training is performed for any of the experiments; all training is done on the pretraining dataset.