reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models

Authors: Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, Baris Kasikci

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation shows that, unlike state-of-the-art systems that optimize for specific scenarios such as single batch inference or long prefill, Fiddler performs better in all scenarios. Compared against different baselines, Fiddler achieves 1.26 times speed up in single batch inference, 1.30 times in long prefill processing, and 11.57 times in beam search inference.
Researcher Affiliation	Academia	1University of Washington 2Tsinghua University EMAIL
Pseudocode	Yes	Algorithm 1 Expert Execution Strategy
Open Source Code	Yes	The code of Fiddler is publicly available at https://github.com/efeslab/fiddler.
Open Datasets	Yes	We use Share GPT (Share GPT), a dataset of conversations between humans and chatbots, to model the realistic behavior of expert selection. We pick the subset of conversations randomly. We implement Fiddler on top of Py Torch (Paszke et al., 2019). Additionally, we give a sensitivity study on different datasets in D to show the effectiveness of Fiddler in a wider variety of routing behaviors. ... Share GPT. Sharegpt. https://huggingface.co/datasets/anon8231489123/ Share GPT_Vicuna_unfiltered. ... LMSYS-Chat-1M datasets (Zheng et al., 2024)
Dataset Splits	No	The paper describes how input tokens are selected for evaluation scenarios (e.g., "For the evaluation with N input tokens, we randomly select samples from Share GPT with N tokens or more of prompt and use the initial N tokens."). It specifies input and output lengths for different inference scenarios (e.g., "the input length is among [32, 64, 128, 256], and the output length is among [64, 128, 256, 512]"), and beam search widths. However, it does not provide details on training/test/validation dataset splits of the underlying Mixtral-8x7B model or any specific splitting methodology in terms of percentages or counts for reproducing data partitioning for model training or evaluation in the traditional sense.
Hardware Specification	Yes	Table 1: Evaluation setups Environment 1 GPU Quadro RTX 6000 (NVIDIA, b) CPU Intel(R) Xeon(R) Gold 6126 (48 core) Environment 2 GPU RTX 6000 Ada (NVIDIA, a) CPU Intel Xeon Platinum 8480+ (112 core)
Software Dependencies	No	The paper mentions "We implement Fiddler on top of Py Torch (Paszke et al., 2019)" without a specific version number for PyTorch. It lists version numbers for baselines like "Deep Speed-MII version v0.2.3" and "llama.cpp version b2956", but these are not for Fiddler's own methodology.
Experiment Setup	Yes	We evaluate Fiddler with the uncompressed (16-bit) Mixtral-8x7B model, which has over 90GB of parameters, on two environments with single GPU each. ... For Mixtral-Offloading, we set the offload_per_layer parameter to 7 for Environment 1 and 5 for Environment 2 as this is the best configuration for the environments we test. For llama.cpp, we set the ngl parameters that control the number of layers being executed in the GPU to be 8 for Environment 1 and 16 for Environment 2. ... For a , the input length is among [32, 64, 128, 256], and the output length is among [64, 128, 256, 512]. The input length for b is among [512, 1024, 2048, 4096]. We set the beam search width for c to be among [4, 8, 12, 16] with an input length of 32 and an output length of 64.