FloE: On-the-Fly MoE Inference on Memory-constrained GPU

Authors: Yuxin Zhou, Zheng Li, Jun Zhang, Jue Wang, Yiping Wang, Zhongle Xie, Ke Chen, Lidan Shou

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Flo E achieves a 9.3 compression of parameters per expert in Mixtral-8 7B; enables deployment on a GPU with only 11GB VRAM, reducing the memory footprint by up to 8.5 ; and delivers a 48.7 inference speedup compared to Deep Speed-MII on a single Ge Force RTX 3090 all with only a 4.4% 7.6% average performance degradation.The experimental study on various GPU specs and downstream tasks evidence the efficiency and efficacy of Flo E (Section 4).
Researcher Affiliation Academia 1The State Key Laboratory of Blockchain and Data Security, Zhejiang University 2Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security 3Paul G. Allen School of Computer Science & Engineering, University of Washington. Correspondence to: Zhongle Xie <EMAIL>, Lidan Shou <EMAIL>.
Pseudocode Yes Algorithm 1 Efficient Sparse Kernel 1: Input: hidden states x, threshold tij, Eij = {Wgate ij , Wdown, ij , Wup ij } 2: v x Wup ij 3: mask (|v| > tij) 4: x Si LU x Wgate ij [mask] v[mask] 5: y (Wdown, ij [mask]x ) 6: Return: y
Open Source Code No The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes Mixtral-8 7B model on the C4 dataset (Raffel et al., 2019) perplexity on Wiki Text-2 (Merity et al., 2016) randomly sample 100 sequences of length 256 from the Share GPT (Share GPT, 20023) seven downstream tasks using the Eleuther AI LM Harness (Gao et al., 2024)
Dataset Splits No The paper mentions using well-known datasets and specific evaluation setups like "zero-shot" and "5-shot MMLU", and for some tests, it states "randomly sample 100 sequences of length 256 from the Share GPT". However, it does not provide specific train/test/validation split percentages, absolute sample counts, or explicit references to predefined splits for general model training or evaluation that would be needed for reproduction.
Hardware Specification Yes Flo E achieves a 48.7 inference speedup compared to Deep Speed-MII on a single Ge Force RTX 3090 We use Ge Force RTX 3090 with 24G VRAM to evaluate end-to-end latency on Share GPT (Share GPT, 20023) prompts. The system is also equipped with a 64-core CPU at 2.3GHz and 256G DRAM interconnected via PCIe 4.0. For the single-expert latency test, we use C4 dataset (Raffel et al., 2019) and employ four types of GPUs, including H100, A100, A6000, and Ge Force RTX 3090.
Software Dependencies No The paper mentions "Triton (Tillet et al., 2019)" and "Pytorch (Paszke et al., 2019)" but does not provide specific version numbers for these or any other software dependencies used in their experiments.
Experiment Setup Yes Empirically, Flo E achieves a 9.3 compression of parameters per expert in Mixtral-8 7B We ran 80 warm-up iterations and 200 timed trials to measure execution latency. Our kernel consistently outperforms the dense baseline (sparsity = 0). We select the average output tokens per second (TPS) as the measurement. using 500 tokens from C4 dataset Optimal chunk size in our setup is 50. Specifically, contextual activation sparsity S( ) is applied to the gate projection Wgateij and down projection Wdown ij to produce WS(gate) ij and WS(down) ij . Meanwhile, ultra-low-bit quantization Q( ) (INT2) is used for the up projection Wup ij to yield WQ(up) ij . We set thresholds for the outputs of the Si LU activation function, Wup, and the inputs to Wdown at various sparsity levels, then measured text perplexity on Wiki Text-2 (Merity et al., 2016).