CipherPrune: Efficient and Scalable Private Transformer Inference

Authors: Yancheng Zhang, Jiaqi Xue, Mengxin Zheng, Mimi Xie, Mingzhe Zhang, Lei Jiang, Qian Lou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that Cipher Prune reduces the execution overhead of private Transformer inference by approximately 6.1 for 128-token inputs and 10.6 for 512-token inputs, compared to previous methods, with only a marginal drop in accuracy. The code is publicly available at https://github.com/UCF-Lou-Lab-PET/cipher-prune-inference. 4 EXPERIMENTS
Researcher Affiliation Collaboration Yancheng Zhang1, Jiaqi Xue1, Mengxin Zheng1 Mimi Xie2, Mingzhe Zhang3, Lei Jiang4, Qian Lou1* 1University of Central Florida 2University of Texas at San Antonio 3Ant Research 4Indiana University Bloomington
Pseudocode Yes Algorithm 1 Crypto-aware Thresholds Learning
Open Source Code Yes The code is publicly available at https://github.com/UCF-Lou-Lab-PET/cipher-prune-inference.
Open Datasets Yes Similar to prior work (Pang et al., 2024), we fine-tune the BERT models on four downstream NLP tasks in GLUE benchmarks (Wang et al., 2018): the Multi-Genre Natural Language Inference Corpus (MNLI), the Stanford Question Answering Dataset (QNLI), the Stanford Sentiment Treebank (SST-2), and the Microsoft Research Paraphrase Corpus (MRPC).
Dataset Splits Yes Similar to prior work (Pang et al., 2024), we fine-tune the BERT models on four downstream NLP tasks in GLUE benchmarks (Wang et al., 2018): the Multi-Genre Natural Language Inference Corpus (MNLI), the Stanford Question Answering Dataset (QNLI), the Stanford Sentiment Treebank (SST-2), and the Microsoft Research Paraphrase Corpus (MRPC).
Hardware Specification Yes All experiments are conducted on an AMD Ryzen Threadripper PRO 3955WX (2.2GHz, 125GB RAM) and fine-tuning of the BERT model with threshold learning is done on NVIDIA Ge Force RTX 3090 GPUs with CUDA 11.0.3.
Software Dependencies Yes Cipher Prune uses the Ez PC (Ez P, 2023) framework and the SEAL (SEA, 2023) library. Ez PC compiles Tensor Flow-based deep neural networks into secure computation protocols running on cryptographic backends... fine-tuning of the BERT model with threshold learning is done on NVIDIA Ge Force RTX 3090 GPUs with CUDA 11.0.3.
Experiment Setup Yes Algorithm 1 Crypto-aware Thresholds Learning Input: pre-trained Transformer M, training data D, initial thresholds θ, β... The hyperparameters λ and α dictate the extent of pruning and approximation, with higher values leading to increased pruning or approximation. In Figure 12, we show the accuracy-latency trade-off for the BERT-Base model under different parameter settings. Larger λ and α result in more tokens being pruned or reduced. With λ less than 0.05, an appropriate ratio of tokens is pruned, maintaining a stable accuracy of around 90%. Smaller α leads to more tokens being computed with high-degree polynomials, which increases accuracy but also latency.