reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Deliberation in Latent Space via Differentiable Cache Augmentation

Authors: Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-speciﬁc training, our experiments demonstrate that cache augmentation consistently improves performance across a range of reasoning-intensive tasks.
Researcher Affiliation	Industry	1Google Deep Mind. Correspondence to: Luyang Liu <EMAIL>, Arthur Szlam <EMAIL>.
Pseudocode	No	The paper describes methods and processes but does not include any explicitly labeled pseudocode or algorithm blocks. The figures illustrate architectural concepts rather than step-by-step algorithms.
Open Source Code	No	The paper does not contain any explicit statements about releasing code, nor does it provide links to any repositories. Phrases like "We release our code" or similar are absent.
Open Datasets	Yes	We evaluate our method using Gemma-2 (Team-Gemma et al., 2024) models pretrained on a diverse dataset mixture. Our augmented Gemma-2 models... are trained on the same 2 trillion token, primarily English dataset used for Gemma-2 pretraining (Team Gemma et al., 2024) We evaluated cache augmentation on a range of public benchmarks spanning natural language understanding and reasoning tasks (Table 2). In this setting, we only call the coprocessor once, at the end of the prompt. Our method consistently improves performance compared to the baseline frozen Gemma-2 2B model, with particularly substantial gains on reasoning-intensive benchmarks. Several tasks, including MMLU, GSM8K, Trivia QA, NQ, and MATH, exhibit a strong correlation between the number of latent embeddings and performance improvement.
Dataset Splits	No	We evaluate this using a proprietary validation dataset (Same as the one used in Gemma (Team-Gemma et al., 2024))... We used Lo RA ﬁnetuning (with a rank of 128) on both the baseline model and our augmented model. For the baseline, Lo RA was applied directly to the base LLM, while for our augmented model, Lo RA was applied speciﬁcally to the coprocessor, leaving the base LLM frozen. This improvement is likely attributable to the strong regularization imposed by keeping the base LLM frozen during coprocessor training. This freezing prevents overﬁtting to the relatively small downstream datasets, allowing the coprocessor to effectively learn task-speciﬁc reasoning patterns without disrupting the general knowledge encoded in the pretrained LLM.
Hardware Specification	No	The paper mentions using "Gemma-2 2B model" and "Gemma-2 9B model" as the base LLMs, but it does not specify any hardware (e.g., GPU, CPU models, or cloud computing platforms) used for training or inference.
Software Dependencies	No	The paper does not provide specific version numbers for any software, libraries, or frameworks used in their experiments. It describes the methodology and models but omits these details.
Experiment Setup	Yes	We trained the model for 100,000 steps using a batch size of 1024, packed sequences of length 2048, 16 ahead tokens (NA), and 128 randomly sampled augmentation positions ( traces ) for all training experiments. Importantly, no task-speciﬁc training is performed for any of the experiments; all training is done on the pretraining dataset.