Cost-efficient Collaboration between On-device and Cloud Language Models

Authors: Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May, Scott Linderman, James Zou, Christopher Re

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental MINIONS reduces costs by 5.7 on average while recovering 97.9% of the remote-only performance. Our analysis reveals several key design choices that influence the tradeoff between cost and performance in local-remote systems. We evaluate MINIONS on three benchmarks that are well suited for data-intensive reasoning: FINANCEBENCH, LONGHEALTH, and QASPER.
Researcher Affiliation Collaboration 1Department of Computer Science, Stanford University 2Department of Statistics, Stanford University 3Together AI 4Department of Biomedical Data Science, Stanford University. Correspondence to: Sabri Eyuboglu <EMAIL>.
Pseudocode Yes def prepare_jobs(context: List[str], prev_job_manifests: Optional[List[Job Manifest]] = None, prev_job_outputs: Optional[List[Job Output]] = None) -> List[Job Manifest]:
Open Source Code No No explicit statement about code release or a link to a repository is provided in the paper.
Open Datasets Yes We evaluate MINIONS on three benchmarks that are well suited for data-intensive reasoning: FINANCEBENCH (Islam et al., 2023), LONGHEALTH (Adams et al., 2024), and QASPER (Dasigi et al., 2021).
Dataset Splits Yes For all ablations in Section 6, we use a fixed subset of 128 problems. We train on 317 questions and test on 17 held-out questions.
Hardware Specification Yes For these experiments, the Local LM is running on a single consumer-grade GPU (e.g. RTX 4090, MSRP $1,599). We run our local models on A100 GPUs.
Software Dependencies No The paper mentions models like GPT-4O, LLAMA, and QWEN2.5, and tools like Ollama and llama.cpp, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes All local-only and remote-only experiments are run with temperature of 0.2. For all MINIONS experiments run in Table 1, we run the Remote LM with a temperature of 0.0 and Local LM with a temperature of 0.2 for FINANCEBENCH and 0.00001 for QASPER and LONGHEALTH.