Agreement-Based Cascading for Efficient Inference
Authors: Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate ABC on a wide range of image and language tasks and find that ABC not only improves efficiency, but also accuracy, compared to the model that it aims to replace. We then consider the performance of ABC relative to existing cascading methods in common inference scenarios, including (1) edge-to-cloud inference where ABC reduces communication costs by up to 14 , (2) model-serving on heterogeneous GPUs, where ABC reduces rental costs by up to 3 and (3) inference using black-box access to model API services, where ABC shows up to a 25 reduction in average price per token. |
| Researcher Affiliation | Academia | Steven Kolawole EMAIL Carnegie Mellon University Don Dennis EMAIL Carnegie Mellon University Ameet Talwalkar EMAIL Carnegie Mellon University Virginia Smith EMAIL Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1 Agreement-Based Cascading (ABC) Require: Set of ensembles {H1, H2, . . . , Hn E}, deferral rule ri for each ensemble i [n E] as in Equation 3 or 4 Require: A new inference data point x. 1: Current cascade level, i 1 2: Cascaded prediction, y 3: for i {1, . . . , n E} do 4: y Hi(x) 5: if ri(x) = 0 then 6: break {Models in ensemble agree } 7: end if 8: end for 9: return y |
| Open Source Code | No | Private repo. Torch Vision Hugging Face Open-CLi P Pareto frontier |
| Open Datasets | Yes | Datasets: To evaluate ABC, we use a range of benchmark datasets for image and language tasks, as shown in Table 2 in the Appendix. Additional datasets are used in 5.2.3 to align with those explored by state-of-the-art baselines. |
| Dataset Splits | No | The paper mentions using a 'small subset of samples from the validation set' (around 100 samples) for threshold estimation, and total sample counts for some datasets (e.g., CIFAR-10 total=10,000, Image Net-1K total=50,000) but does not provide specific training/test/validation dataset splits or reference standard splits for all experiments. |
| Hardware Specification | Yes | For instance, based on the current pricing model offered by Lambda (Lambda, 2024), a popular cloud rental platform, the rental pricing of a single A100 is $1.40/hour and a V100 node is $0.06/hour (γ ≈ 4 · 10−2), while the rated 32-bit tensor core throughput is 312 TFLOPs for A100 and 125 TFLOPS for V100. In this scenario, a simple placement strategy for a 2-level ABC that reduces inference cost may place the smaller model on V100 nodes and larger models on A100 nodes. |
| Software Dependencies | No | The paper mentions using models from Hugging Face Zoo and refers to Torch Vision and Distil BERT, but does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch, CUDA) required to replicate the experiments. |
| Experiment Setup | Yes | Estimating Voting Threshold: ABC’s deferral rule uses a configurable voting threshold, θ (see Equations 3 and 4) at each cascading tier. We estimate θ empirically on a small set of unseen data; see App. B for details. |