Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
Authors: Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart Van Baalen, Markus Nagel, Paul N. Whatmough, Babak Ehteshami Bejnordi
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluating on language modeling, MMLU, and GSM8K benchmarks, our method reduces cache miss rates by over 50%, with negligible impact on perplexity (0.1% 3%) and downstream task accuracy (<0.1%). Finally, we present on-device results demonstrating 2 speedups on mobile hardware, offering a flexible and training-free solution to extend Mo E s applicability across real-world applications. We conduct experiments on three tasks: language modeling (using the Wiki Text-2-raw-v1 dataset), MMLU, and GSM8K. For Wiki Text, we report perplexity and cache miss rate, while for MMLU and GSM8K, we report accuracy and cache miss rate. |
| Researcher Affiliation | Industry | Andrii Skliar EMAIL Contextual AI Ties van Rozendaal EMAIL Qualcomm AI Research Romain Lepert EMAIL Qualcomm AI Research Todor Boinovski EMAIL Qualcomm AI Research Mart van Baalen EMAIL Qualcomm AI Research Markus Nagel EMAIL Qualcomm AI Research Paul Whatmough EMAIL Qualcomm AI Research Babak Ehteshami Bejnordi EMAIL Qualcomm AI Research |
| Pseudocode | Yes | Finally, the pseudocode for the max-rank algorithm with always keeping the top-J experts is shown in Algorithm 1. Additionally, we provide an intuitive explanation of the algorithm with an example in Appendix B. ... The pseudocode for the cumulative probability threshold approach can be found in Algorithm 2. |
| Open Source Code | No | Using llama-cpp (Gerganov, 2023) for CPU-based deployment, we modified the implementation to add both LRU caching for experts and our Cache-Prior algorithm. The paper does not explicitly state that their modified code or the code for their specific method is publicly available. |
| Open Datasets | Yes | We conduct experiments on three tasks: language modeling (using the Wiki Text-2-raw-v1 dataset), MMLU, and GSM8K. |
| Dataset Splits | No | For Wiki Text, we report perplexity and cache miss rate, while for MMLU and GSM8K, we report accuracy and cache miss rate. The MMLU dataset consists of multiple-choice questions across 57 subjects, and GSM8K evaluates multi-step reasoning for math problems. ... For dataset preprocessing, we concatenate Wiki Text text into a single blob, split by "nn", and chunk it into context lengths of 1024. For MMLU and GSM8K, we apply a few-shot approach (5 shots for MMLU and 8 shots with chain-of-thought for GSM8K). The paper describes evaluation methodologies and data processing, but does not specify explicit training, validation, and test splits with percentages or sample counts for reproducibility. |
| Hardware Specification | Yes | To evaluate the effectiveness of our cache-aware routing technique in real-world scenarios, we deployed the Qwen1.5-Mo E-A2.7B model with our cache-aware routing on two mobile devices (12GB and 16GB RAM) equipped with Qualcomm Snapdragon processors running Android 14. |
| Software Dependencies | No | Using llama-cpp (Gerganov, 2023) for CPU-based deployment, we modified the implementation to add both LRU caching for experts and our Cache-Prior algorithm. Additionally, we enabled memory locking (mlock) to prevent the Android OS from offloading expert weights from memory. The paper mentions 'llama-cpp' and 'Android 14' but does not provide specific version numbers for software libraries or dependencies essential for reproducing the experiment, beyond the year for llama-cpp reference. |
| Experiment Setup | Yes | Each cache-aware routing strategy has a hyperparameter to balance cache miss rate and task performance. We use the following values to generate Pareto curves: Pruning and Max-Rank use 0, 1, ..., K, while Cumulative Sum Thresholding and Cache-Prior use 50 equidistant points in [0, 1]. For guaranteed top-J loading, we set J = 1 for Mixtral and Phi-Mo E models and J = 2 for the granular Qwen-Mo E and Deep Seek-Mo E architectures. ... For all experiments, the cache miss rate is computed using the Least Recently Used (LRU) eviction policy, unless stated otherwise in ablations. |