Efficient Inference With Model Cascades
Authors: Luzian Lebovitz, Lukas Cavigelli, Michele Magno, Lorenz K Muller
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work we explore the effective design of model cascades, thoroughly evaluate the impact on the accuracy-efficiency trade-off, and provide a reproducible state-of-the-art baseline that is currently missing for related research. We demonstrate that model cascades dominate the Image Net Pareto front already with 2-model cascades, achieving an average reduction in compute effort at equal accuracy of almost 3.1 above 86% and more than 1.9 between 80% and 86% top-1 accuracy, while 3-model cascades achieve 4.4 above 87% accuracy. We confirm wider applicability and effectiveness of the method on the GLUE benchmark. |
| Researcher Affiliation | Collaboration | Luzian Lebovitz EMAIL Department of Electrical Engineering & Information Technology ETH Zurich Lukas Cavigelli EMAIL Computing Systems Lab Huawei Technologies Michele Magno EMAIL Department of Electrical Engineering & Information Technology ETH Zurich Lorenz Müller EMAIL Computing Systems Lab Huawei Technologies |
| Pseudocode | Yes | Algorithm 1 Early exit model cascade with maximum softmax confidence metric and no ensembling Require: input tensor X, models {M1, ..., Mn} ordered by increasing cost, thresholds {t1, ..., tn 1}, n 2 for i = 1, ..., n do zi = Mi(X) pi = softmax(zi) if i == n or max(pi) ti then return arg max(pi) cascade returns predicted class |
| Open Source Code | Yes | We release the code to reproduce our experiments in the supplementary material and use only publicly available pretrained models and datasets. |
| Open Datasets | Yes | Most of our experiments are conducted on Image Net (Russakovsky et al., 2015) due to its significance and the large amount of pretrained models available from Py Torch Image Models (Wightman, 2019). We then test their wider applicability on text classification tasks from the GLUE (Wang et al., 2019) benchmark using models from Hugging Face (Wolf et al., 2020). |
| Dataset Splits | Yes | Most of our experiments are conducted on Image Net (Russakovsky et al., 2015) due to its significance and the large amount of pretrained models available from Py Torch Image Models (Wightman, 2019). We infer the Image Net validation set for models at the Pareto fronts to confirm their accuracy and obtain logits, from which we compute the prediction correctness and confidence score for every image. To test against validation set overfitting, points of the cascade Pareto front with locally maximal improvement were evaluated on the Image Net test set. |
| Hardware Specification | Yes | To acquire the time cost we use benchmark numbers in inferred samples per second on an RTX 3090 with NHWC data format and automatic mixed precision from Wightman (2019). |
| Software Dependencies | No | The paper mentions using Py Torch Image Models (Wightman, 2019), Hugging Face (Wolf et al., 2020), and the fvcore library. While these are specific tools, explicit version numbers for these software components or underlying frameworks like PyTorch itself are not provided. |
| Experiment Setup | Yes | Algorithm 1 shows the implementation of the cascading method we found to be most effective. The performance of a cascade can be evaluated on a validation set. First, the logits are obtained for each model with a batched forward pass on the entire validation set. From the logits, the confidence score of each input can be calculated according to the used metric. For 2-model cascades, a k 3 array is constructed containing the confidence score for the first model and prediction correctness for both models for each of the k examples in the validation set. |