Efficient Inference With Model Cascades

Authors: Luzian Lebovitz, Lukas Cavigelli, Michele Magno, Lorenz K Muller

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work we explore the effective design of model cascades, thoroughly evaluate the impact on the accuracy-efficiency trade-off, and provide a reproducible state-of-the-art baseline that is currently missing for related research. We demonstrate that model cascades dominate the Image Net Pareto front already with 2-model cascades, achieving an average reduction in compute effort at equal accuracy of almost 3.1 above 86% and more than 1.9 between 80% and 86% top-1 accuracy, while 3-model cascades achieve 4.4 above 87% accuracy. We confirm wider applicability and effectiveness of the method on the GLUE benchmark.
Researcher Affiliation Collaboration Luzian Lebovitz EMAIL Department of Electrical Engineering & Information Technology ETH Zurich Lukas Cavigelli EMAIL Computing Systems Lab Huawei Technologies Michele Magno EMAIL Department of Electrical Engineering & Information Technology ETH Zurich Lorenz Müller EMAIL Computing Systems Lab Huawei Technologies
Pseudocode Yes Algorithm 1 Early exit model cascade with maximum softmax confidence metric and no ensembling Require: input tensor X, models {M1, ..., Mn} ordered by increasing cost, thresholds {t1, ..., tn 1}, n 2 for i = 1, ..., n do zi = Mi(X) pi = softmax(zi) if i == n or max(pi) ti then return arg max(pi) cascade returns predicted class
Open Source Code Yes We release the code to reproduce our experiments in the supplementary material and use only publicly available pretrained models and datasets.
Open Datasets Yes Most of our experiments are conducted on Image Net (Russakovsky et al., 2015) due to its significance and the large amount of pretrained models available from Py Torch Image Models (Wightman, 2019). We then test their wider applicability on text classification tasks from the GLUE (Wang et al., 2019) benchmark using models from Hugging Face (Wolf et al., 2020).
Dataset Splits Yes Most of our experiments are conducted on Image Net (Russakovsky et al., 2015) due to its significance and the large amount of pretrained models available from Py Torch Image Models (Wightman, 2019). We infer the Image Net validation set for models at the Pareto fronts to confirm their accuracy and obtain logits, from which we compute the prediction correctness and confidence score for every image. To test against validation set overfitting, points of the cascade Pareto front with locally maximal improvement were evaluated on the Image Net test set.
Hardware Specification Yes To acquire the time cost we use benchmark numbers in inferred samples per second on an RTX 3090 with NHWC data format and automatic mixed precision from Wightman (2019).
Software Dependencies No The paper mentions using Py Torch Image Models (Wightman, 2019), Hugging Face (Wolf et al., 2020), and the fvcore library. While these are specific tools, explicit version numbers for these software components or underlying frameworks like PyTorch itself are not provided.
Experiment Setup Yes Algorithm 1 shows the implementation of the cascading method we found to be most effective. The performance of a cascade can be evaluated on a validation set. First, the logits are obtained for each model with a batched forward pass on the entire validation set. From the logits, the confidence score of each input can be calculated according to the used metric. For 2-model cascades, a k 3 array is constructed containing the confidence score for the first model and prediction correctness for both models for each of the k examples in the validation set.