Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models

Authors: Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, Baris Kasikci

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation shows that, unlike state-of-the-art systems that optimize for specific scenarios such as single batch inference or long prefill, Fiddler performs better in all scenarios. Compared against different baselines, Fiddler achieves 1.26 times speed up in single batch inference, 1.30 times in long prefill processing, and 11.57 times in beam search inference.
Researcher Affiliation Academia 1University of Washington 2Tsinghua University EMAIL
Pseudocode Yes Algorithm 1 Expert Execution Strategy
Open Source Code Yes The code of Fiddler is publicly available at https://github.com/efeslab/fiddler.
Open Datasets Yes We use Share GPT (Share GPT), a dataset of conversations between humans and chatbots, to model the realistic behavior of expert selection. We pick the subset of conversations randomly. We implement Fiddler on top of Py Torch (Paszke et al., 2019). Additionally, we give a sensitivity study on different datasets in D to show the effectiveness of Fiddler in a wider variety of routing behaviors. ... Share GPT. Sharegpt. https://huggingface.co/datasets/anon8231489123/ Share GPT_Vicuna_unfiltered. ... LMSYS-Chat-1M datasets (Zheng et al., 2024)
Dataset Splits No The paper describes how input tokens are selected for evaluation scenarios (e.g., "For the evaluation with N input tokens, we randomly select samples from Share GPT with N tokens or more of prompt and use the initial N tokens."). It specifies input and output lengths for different inference scenarios (e.g., "the input length is among [32, 64, 128, 256], and the output length is among [64, 128, 256, 512]"), and beam search widths. However, it does not provide details on training/test/validation dataset splits of the underlying Mixtral-8x7B model or any specific splitting methodology in terms of percentages or counts for reproducing data partitioning for model training or evaluation in the traditional sense.
Hardware Specification Yes Table 1: Evaluation setups Environment 1 GPU Quadro RTX 6000 (NVIDIA, b) CPU Intel(R) Xeon(R) Gold 6126 (48 core) Environment 2 GPU RTX 6000 Ada (NVIDIA, a) CPU Intel Xeon Platinum 8480+ (112 core)
Software Dependencies No The paper mentions "We implement Fiddler on top of Py Torch (Paszke et al., 2019)" without a specific version number for PyTorch. It lists version numbers for baselines like "Deep Speed-MII version v0.2.3" and "llama.cpp version b2956", but these are not for Fiddler's own methodology.
Experiment Setup Yes We evaluate Fiddler with the uncompressed (16-bit) Mixtral-8x7B model, which has over 90GB of parameters, on two environments with single GPU each. ... For Mixtral-Offloading, we set the offload_per_layer parameter to 7 for Environment 1 and 5 for Environment 2 as this is the best configuration for the environments we test. For llama.cpp, we set the ngl parameters that control the number of layers being executed in the GPU to be 8 for Environment 1 and 16 for Environment 2. ... For a , the input length is among [32, 64, 128, 256], and the output length is among [64, 128, 256, 512]. The input length for b is among [512, 1024, 2048, 4096]. We set the beam search width for c to be among [4, 8, 12, 16] with an input length of 32 and an output length of 64.