Fast Inference with Kronecker-Sparse Matrices
Authors: Antoine Gonon, Léon Zheng, Pascal Carrivain, Tung Quoc Le
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across 600 KS patterns, our kernel achieves in FP32 a median speedup of 1.4 and lowers energy consumption by 15 %. A simple heuristic based on KS pattern parameters predicts when our method outperforms existing ones. We release all code1 at github.com/Pascal Carrivain/ksmm, including a Py Torch-compatible KSLinear layer, and demonstrate in FP32 end-to-end latency reductions of up to 22 % in Vi T-S/16 and 16 % in GPT-2 medium. |
| Researcher Affiliation | Collaboration | 1ENS de Lyon, CNRS, Inria, Universit e Claude Bernard Lyon 1, LIP, UMR 5668, 69342, Lyon cedex 07, France 2Institute of Mathematics, EPFL, Lausanne, Switzerland 3valeo.ai, Paris, France 4Huawei Lagrange Mathematics and Computing Research Center, Paris, France 5Toulouse School of Economics, Toulouse, France. |
| Pseudocode | Yes | Algorithm 1 Permutation-based KS matmul ... Algorithm 2 New mathematically equivalent tiling of Algorithm 1 (no global memory permutations), see Figure 5. ... Algorithm 3 Sketch of the fused output-stationary kernel (one tile (rowi,j, coli,j) assigned to each thread block). |
| Open Source Code | Yes | We release all code1 at github.com/Pascal Carrivain/ksmm, including a Py Torch-compatible KSLinear layer |
| Open Datasets | No | The paper uses models like Vi T-S/16 and GPT-2 Medium, which are typically evaluated on standard datasets like ImageNet or common text corpora. However, it does not explicitly state that the datasets themselves are publicly available, nor does it provide concrete access information (links, DOIs, citations for the datasets) for the data used in its experiments. |
| Dataset Splits | No | The paper focuses on accelerating inference for pre-trained models (Vi T-S/16 and GPT-2 Medium) and does not describe training or evaluation on specific dataset splits. Information like batch size (e.g., 'B = 128 × 196 = 25 088') is provided for inference configurations, but no details regarding training, validation, or test splits of any dataset are given, as the paper's scope is inference optimization. |
| Hardware Specification | Yes | Benchmarking Time Execution. All the experiments measuring time execution of a Kronecker-sparse matrix multiplication algorithm... are performed on a NVIDIA A100-PCIE-40GB GPU associated with an Intel(R) Xeon(R) Silver 4215R CPU @ 3.20GHz with 377G of memory. ... Benchmarking Energy Consumption. Measurements of the energy consumption... is done on a NVIDIA Tesla V100-PCIE-16GB GPU associated with an Intel(R) Xeon(R) Silver 4215R CPU @ 3.20GHz with 754G of memory. |
| Software Dependencies | Yes | The pytorch package version is 2.2 and pytorch-cuda is 12.1. |
| Experiment Setup | Yes | Benchmarked KS Patterns. We explore patterns of the form π = (a, b, c, d), sweeping over the space α × β × β × α, with α = {1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, 96, 128} and β = {48, 64, 96, 128, 192, 256, 384, 512, 768, 1024}. We constrain the shapes by enforcing b = c, b = 4c, or c = 4b... The input batch size is fixed to B = 128 × 196 = 25 088... The nonzero entries of any Kronecker-sparse matrix K ∈ R abd × acd with sparsity pattern (a, b, c, d) are drawn i.i.d. uniformly in [−1/c, 1/c]... The entries of the inputs X are drawn i.i.d. according to a standard normal distribution N(0, 1). |