reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Authors: Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We have developed a theoretical framework that allows us to analyze the trade-off between computation and latency. ... Additionally, we implement PPD and conduct preliminary experiments to empirically validate its efficacy, addressing potential practical overheads not covered by theoretical analysis. ... We implement PPD and conduct some preliminary experiments to provide empirical evidence supporting the efficacy of the method since the theoretical analysis does not fully address the potential overheads that may invalidate the latency gains in practical applications.
Researcher Affiliation	Collaboration	Seongjun Yang* EMAIL KRAFTON Gibbeum Lee* EMAIL KRAFTON Jaewoong Cho EMAIL KRAFTON Dimitris Papailiopoulos EMAIL University of Wisconsin-Madison Kangwook Lee EMAIL University of Wisconsin-Madison KRAFTON
Pseudocode	Yes	Algorithm 1 Predictive Pipelined Decoding (PPD) Algorithm 2 Predictive Pipelined Decoding (PPD)
Open Source Code	Yes	The code for our implementation is available in the supplementary material.
Open Datasets	Yes	To assess the potential benefits of our method, we analyze to determine the extent of latency reduction and the associated compute resource costs. Also, we measure the match rate, the probability that the early top-k predictions match the prediction from the final layer, with the commonly utilized dataset in NLP such as SQUAD 1.1 (Rajpurkar et al., 2016), WMT EN-FR (Bojar et al., 2015), and CNN/DM (Hermann et al., 2015).
Dataset Splits	Yes	We use their respective test datasets for evaluations. To be specific, SQUAD 1.1 Rajpurkar et al. (2016) is a Question Answering dataset that has 10,570 test pairs. WMT15 FR-EN Bojar et al. (2015) is a machine translation dataset that includes 1,500 test pairs of English to French translations. CNN/DM Hermann et al. (2015) is a dataset for text summarization which has 11,487 test pairs.
Hardware Specification	Yes	The training process is performed on 8 A100 GPUs, and the hyperparameters can be found in the Table 4 in Appendix A.
Software Dependencies	No	All experiments are conducted using the Huggingface Transformers library (Wolf et al., 2020). We specifically probe the early prediction at the 15th, 20th, 30th, 35th, and 37th layers to derive the match rate. ... We employ the LLa MA-2 (Touvron et al., 2023b) 13B model and conduct tests on several examples from Summarization (CNN/DM) and Question Answering (SQUAD 1.1) tasks. The paper mentions software libraries and models but does not provide specific version numbers for them, which are required for a reproducible description of software dependencies.
Experiment Setup	Yes	The training process is performed on 8 A100 GPUs, and the hyperparameters can be found in the Table 4 in Appendix A. Table 4: Hyperparameters used for training a language modeling classifier for an intermediate layer. Hyperparameter Value Number of Epochs 3 Learning Rate 0.00002 Batch Size 128 Optimizer Adam W Loss Function Cross-Entropy Max Sequence Length 2048 Warmup ratio 0.04 Weight Decay 0.0