FloE: On-the-Fly MoE Inference on Memory-constrained GPU
Authors: Yuxin Zhou, Zheng Li, Jun Zhang, Jue Wang, Yiping Wang, Zhongle Xie, Ke Chen, Lidan Shou
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, Flo E achieves a 9.3 compression of parameters per expert in Mixtral-8 7B; enables deployment on a GPU with only 11GB VRAM, reducing the memory footprint by up to 8.5 ; and delivers a 48.7 inference speedup compared to Deep Speed-MII on a single Ge Force RTX 3090 all with only a 4.4% 7.6% average performance degradation.The experimental study on various GPU specs and downstream tasks evidence the efficiency and efficacy of Flo E (Section 4). |
| Researcher Affiliation | Academia | 1The State Key Laboratory of Blockchain and Data Security, Zhejiang University 2Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security 3Paul G. Allen School of Computer Science & Engineering, University of Washington. Correspondence to: Zhongle Xie <EMAIL>, Lidan Shou <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Efficient Sparse Kernel 1: Input: hidden states x, threshold tij, Eij = {Wgate ij , Wdown, ij , Wup ij } 2: v x Wup ij 3: mask (|v| > tij) 4: x Si LU x Wgate ij [mask] v[mask] 5: y (Wdown, ij [mask]x ) 6: Return: y |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Mixtral-8 7B model on the C4 dataset (Raffel et al., 2019) perplexity on Wiki Text-2 (Merity et al., 2016) randomly sample 100 sequences of length 256 from the Share GPT (Share GPT, 20023) seven downstream tasks using the Eleuther AI LM Harness (Gao et al., 2024) |
| Dataset Splits | No | The paper mentions using well-known datasets and specific evaluation setups like "zero-shot" and "5-shot MMLU", and for some tests, it states "randomly sample 100 sequences of length 256 from the Share GPT". However, it does not provide specific train/test/validation split percentages, absolute sample counts, or explicit references to predefined splits for general model training or evaluation that would be needed for reproduction. |
| Hardware Specification | Yes | Flo E achieves a 48.7 inference speedup compared to Deep Speed-MII on a single Ge Force RTX 3090 We use Ge Force RTX 3090 with 24G VRAM to evaluate end-to-end latency on Share GPT (Share GPT, 20023) prompts. The system is also equipped with a 64-core CPU at 2.3GHz and 256G DRAM interconnected via PCIe 4.0. For the single-expert latency test, we use C4 dataset (Raffel et al., 2019) and employ four types of GPUs, including H100, A100, A6000, and Ge Force RTX 3090. |
| Software Dependencies | No | The paper mentions "Triton (Tillet et al., 2019)" and "Pytorch (Paszke et al., 2019)" but does not provide specific version numbers for these or any other software dependencies used in their experiments. |
| Experiment Setup | Yes | Empirically, Flo E achieves a 9.3 compression of parameters per expert in Mixtral-8 7B We ran 80 warm-up iterations and 200 timed trials to measure execution latency. Our kernel consistently outperforms the dense baseline (sparsity = 0). We select the average output tokens per second (TPS) as the measurement. using 500 tokens from C4 dataset Optimal chunk size in our setup is 50. Specifically, contextual activation sparsity S( ) is applied to the gate projection Wgateij and down projection Wdown ij to produce WS(gate) ij and WS(down) ij . Meanwhile, ultra-low-bit quantization Q( ) (INT2) is used for the up projection Wup ij to yield WQ(up) ij . We set thresholds for the outputs of the Si LU activation function, Wup, and the inputs to Wdown at various sparsity levels, then measured text perplexity on Wiki Text-2 (Merity et al., 2016). |