reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Agreement-Based Cascading for Efficient Inference

Authors: Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate ABC on a wide range of image and language tasks and find that ABC not only improves efficiency, but also accuracy, compared to the model that it aims to replace. We then consider the performance of ABC relative to existing cascading methods in common inference scenarios, including (1) edge-to-cloud inference where ABC reduces communication costs by up to 14 , (2) model-serving on heterogeneous GPUs, where ABC reduces rental costs by up to 3 and (3) inference using black-box access to model API services, where ABC shows up to a 25 reduction in average price per token.
Researcher Affiliation	Academia	Steven Kolawole EMAIL Carnegie Mellon University Don Dennis EMAIL Carnegie Mellon University Ameet Talwalkar EMAIL Carnegie Mellon University Virginia Smith EMAIL Carnegie Mellon University
Pseudocode	Yes	Algorithm 1 Agreement-Based Cascading (ABC) Require: Set of ensembles {H1, H2, . . . , Hn E}, deferral rule ri for each ensemble i [n E] as in Equation 3 or 4 Require: A new inference data point x. 1: Current cascade level, i 1 2: Cascaded prediction, y 3: for i {1, . . . , n E} do 4: y Hi(x) 5: if ri(x) = 0 then 6: break {Models in ensemble agree } 7: end if 8: end for 9: return y
Open Source Code	No	Private repo. Torch Vision Hugging Face Open-CLi P Pareto frontier
Open Datasets	Yes	Datasets: To evaluate ABC, we use a range of benchmark datasets for image and language tasks, as shown in Table 2 in the Appendix. Additional datasets are used in 5.2.3 to align with those explored by state-of-the-art baselines.
Dataset Splits	No	The paper mentions using a 'small subset of samples from the validation set' (around 100 samples) for threshold estimation, and total sample counts for some datasets (e.g., CIFAR-10 total=10,000, Image Net-1K total=50,000) but does not provide specific training/test/validation dataset splits or reference standard splits for all experiments.
Hardware Specification	Yes	For instance, based on the current pricing model offered by Lambda (Lambda, 2024), a popular cloud rental platform, the rental pricing of a single A100 is $1.40/hour and a V100 node is $0.06/hour (γ ≈ 4 · 10−2), while the rated 32-bit tensor core throughput is 312 TFLOPs for A100 and 125 TFLOPS for V100. In this scenario, a simple placement strategy for a 2-level ABC that reduces inference cost may place the smaller model on V100 nodes and larger models on A100 nodes.
Software Dependencies	No	The paper mentions using models from Hugging Face Zoo and refers to Torch Vision and Distil BERT, but does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch, CUDA) required to replicate the experiments.
Experiment Setup	Yes	Estimating Voting Threshold: ABC’s deferral rule uses a configurable voting threshold, θ (see Equations 3 and 4) at each cascading tier. We estimate θ empirically on a small set of unseen data; see App. B for details.