reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fast Inference with Kronecker-Sparse Matrices

Authors: Antoine Gonon, Léon Zheng, Pascal Carrivain, Tung Quoc Le

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across 600 KS patterns, our kernel achieves in FP32 a median speedup of 1.4 and lowers energy consumption by 15 %. A simple heuristic based on KS pattern parameters predicts when our method outperforms existing ones. We release all code1 at github.com/Pascal Carrivain/ksmm, including a Py Torch-compatible KSLinear layer, and demonstrate in FP32 end-to-end latency reductions of up to 22 % in Vi T-S/16 and 16 % in GPT-2 medium.
Researcher Affiliation	Collaboration	1ENS de Lyon, CNRS, Inria, Universit e Claude Bernard Lyon 1, LIP, UMR 5668, 69342, Lyon cedex 07, France 2Institute of Mathematics, EPFL, Lausanne, Switzerland 3valeo.ai, Paris, France 4Huawei Lagrange Mathematics and Computing Research Center, Paris, France 5Toulouse School of Economics, Toulouse, France.
Pseudocode	Yes	Algorithm 1 Permutation-based KS matmul ... Algorithm 2 New mathematically equivalent tiling of Algorithm 1 (no global memory permutations), see Figure 5. ... Algorithm 3 Sketch of the fused output-stationary kernel (one tile (rowi,j, coli,j) assigned to each thread block).
Open Source Code	Yes	We release all code1 at github.com/Pascal Carrivain/ksmm, including a Py Torch-compatible KSLinear layer
Open Datasets	No	The paper uses models like Vi T-S/16 and GPT-2 Medium, which are typically evaluated on standard datasets like ImageNet or common text corpora. However, it does not explicitly state that the datasets themselves are publicly available, nor does it provide concrete access information (links, DOIs, citations for the datasets) for the data used in its experiments.
Dataset Splits	No	The paper focuses on accelerating inference for pre-trained models (Vi T-S/16 and GPT-2 Medium) and does not describe training or evaluation on specific dataset splits. Information like batch size (e.g., 'B = 128 × 196 = 25 088') is provided for inference configurations, but no details regarding training, validation, or test splits of any dataset are given, as the paper's scope is inference optimization.
Hardware Specification	Yes	Benchmarking Time Execution. All the experiments measuring time execution of a Kronecker-sparse matrix multiplication algorithm... are performed on a NVIDIA A100-PCIE-40GB GPU associated with an Intel(R) Xeon(R) Silver 4215R CPU @ 3.20GHz with 377G of memory. ... Benchmarking Energy Consumption. Measurements of the energy consumption... is done on a NVIDIA Tesla V100-PCIE-16GB GPU associated with an Intel(R) Xeon(R) Silver 4215R CPU @ 3.20GHz with 754G of memory.
Software Dependencies	Yes	The pytorch package version is 2.2 and pytorch-cuda is 12.1.
Experiment Setup	Yes	Benchmarked KS Patterns. We explore patterns of the form π = (a, b, c, d), sweeping over the space α × β × β × α, with α = {1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, 96, 128} and β = {48, 64, 96, 128, 192, 256, 384, 512, 768, 1024}. We constrain the shapes by enforcing b = c, b = 4c, or c = 4b... The input batch size is fixed to B = 128 × 196 = 25 088... The nonzero entries of any Kronecker-sparse matrix K ∈ R abd × acd with sparsity pattern (a, b, c, d) are drawn i.i.d. uniformly in [−1/c, 1/c]... The entries of the inputs X are drawn i.i.d. according to a standard normal distribution N(0, 1).