reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cost-efficient Collaboration between On-device and Cloud Language Models

Authors: Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May, Scott Linderman, James Zou, Christopher Re

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	MINIONS reduces costs by 5.7 on average while recovering 97.9% of the remote-only performance. Our analysis reveals several key design choices that influence the tradeoff between cost and performance in local-remote systems. We evaluate MINIONS on three benchmarks that are well suited for data-intensive reasoning: FINANCEBENCH, LONGHEALTH, and QASPER.
Researcher Affiliation	Collaboration	1Department of Computer Science, Stanford University 2Department of Statistics, Stanford University 3Together AI 4Department of Biomedical Data Science, Stanford University. Correspondence to: Sabri Eyuboglu <EMAIL>.
Pseudocode	Yes	def prepare_jobs(context: List[str], prev_job_manifests: Optional[List[Job Manifest]] = None, prev_job_outputs: Optional[List[Job Output]] = None) -> List[Job Manifest]:
Open Source Code	No	No explicit statement about code release or a link to a repository is provided in the paper.
Open Datasets	Yes	We evaluate MINIONS on three benchmarks that are well suited for data-intensive reasoning: FINANCEBENCH (Islam et al., 2023), LONGHEALTH (Adams et al., 2024), and QASPER (Dasigi et al., 2021).
Dataset Splits	Yes	For all ablations in Section 6, we use a fixed subset of 128 problems. We train on 317 questions and test on 17 held-out questions.
Hardware Specification	Yes	For these experiments, the Local LM is running on a single consumer-grade GPU (e.g. RTX 4090, MSRP $1,599). We run our local models on A100 GPUs.
Software Dependencies	No	The paper mentions models like GPT-4O, LLAMA, and QWEN2.5, and tools like Ollama and llama.cpp, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	All local-only and remote-only experiments are run with temperature of 0.2. For all MINIONS experiments run in Table 1, we run the Remote LM with a temperature of 0.0 and Local LM with a temperature of 0.2 for FINANCEBENCH and 0.00001 for QASPER and LONGHEALTH.