Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding
Authors: Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have developed a theoretical framework that allows us to analyze the trade-off between computation and latency. ... Additionally, we implement PPD and conduct preliminary experiments to empirically validate its efficacy, addressing potential practical overheads not covered by theoretical analysis. ... We implement PPD and conduct some preliminary experiments to provide empirical evidence supporting the efficacy of the method since the theoretical analysis does not fully address the potential overheads that may invalidate the latency gains in practical applications. |
| Researcher Affiliation | Collaboration | Seongjun Yang* EMAIL KRAFTON Gibbeum Lee* EMAIL KRAFTON Jaewoong Cho EMAIL KRAFTON Dimitris Papailiopoulos EMAIL University of Wisconsin-Madison Kangwook Lee EMAIL University of Wisconsin-Madison KRAFTON |
| Pseudocode | Yes | Algorithm 1 Predictive Pipelined Decoding (PPD) Algorithm 2 Predictive Pipelined Decoding (PPD) |
| Open Source Code | Yes | The code for our implementation is available in the supplementary material. |
| Open Datasets | Yes | To assess the potential benefits of our method, we analyze to determine the extent of latency reduction and the associated compute resource costs. Also, we measure the match rate, the probability that the early top-k predictions match the prediction from the final layer, with the commonly utilized dataset in NLP such as SQUAD 1.1 (Rajpurkar et al., 2016), WMT EN-FR (Bojar et al., 2015), and CNN/DM (Hermann et al., 2015). |
| Dataset Splits | Yes | We use their respective test datasets for evaluations. To be specific, SQUAD 1.1 Rajpurkar et al. (2016) is a Question Answering dataset that has 10,570 test pairs. WMT15 FR-EN Bojar et al. (2015) is a machine translation dataset that includes 1,500 test pairs of English to French translations. CNN/DM Hermann et al. (2015) is a dataset for text summarization which has 11,487 test pairs. |
| Hardware Specification | Yes | The training process is performed on 8 A100 GPUs, and the hyperparameters can be found in the Table 4 in Appendix A. |
| Software Dependencies | No | All experiments are conducted using the Huggingface Transformers library (Wolf et al., 2020). We specifically probe the early prediction at the 15th, 20th, 30th, 35th, and 37th layers to derive the match rate. ... We employ the LLa MA-2 (Touvron et al., 2023b) 13B model and conduct tests on several examples from Summarization (CNN/DM) and Question Answering (SQUAD 1.1) tasks. The paper mentions software libraries and models but does not provide specific version numbers for them, which are required for a reproducible description of software dependencies. |
| Experiment Setup | Yes | The training process is performed on 8 A100 GPUs, and the hyperparameters can be found in the Table 4 in Appendix A. Table 4: Hyperparameters used for training a language modeling classifier for an intermediate layer. Hyperparameter Value Number of Epochs 3 Learning Rate 0.00002 Batch Size 128 Optimizer Adam W Loss Function Cross-Entropy Max Sequence Length 2048 Warmup ratio 0.04 Weight Decay 0.0 |