TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference
Authors: Jack Min Ong, Matthew Di Ferrante, Aaron Pazdera, Ryan Garner, Sami Jaghouar, Manveer Basra, Max Ryabinin, Johannes Hagemann
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach is robust across diverse hardware configurations, GPU types, and algebraic reorderings, which allows for validation speeds significantly faster than the original inference. By introducing a polynomial encoding scheme, TOPLOC minimizes the memory overhead of the generated proofs by 1000 , requiring only 258 bytes of storage per 32 new tokens, compared to the 262 KB requirement of storing the token embeddings directly for Llama 3.1-8B-Instruct. Our method empowers users to verify LLM inference computations efficiently, fostering greater trust and transparency in open ecosystems and laying a foundation for decentralized, verifiable and trustless AI services. |
| Researcher Affiliation | Industry | 1Prime Intellect 2Together AI. |
| Pseudocode | Yes | Algorithm 1 TOPLOC Proof Generation Algorithm Algorithm 2 TOPLOC Proof Validation Algorithm |
| Open Source Code | Yes | The corresponding implementations are also available on Git Hub4. 4github.com/Prime Intellect-ai/toploc |
| Open Datasets | Yes | For our experiments, we use the Ultra Chat dataset (Ding et al., 2023). The Ultra Chat dataset contains 1.4 million dialogues consisting of real-world inquiries, creative writing prompts, and various other text-based tasks such as rewriting, continuation, summarization, and inference, covering a wide range of topics. |
| Dataset Splits | No | The paper mentions using the Ultra Chat dataset and running 2000 queries or 400 prompts. However, it does not specify any explicit training, validation, or test splits, nor does it provide percentages or sample counts for these splits. The dataset is used for generation and validation tasks without detailed partitioning information for reproducibility. |
| Hardware Specification | Yes | We report the worst-case error statistics for different tensor parallelism and GPU combinations in Table 2. Here, none of the error statistics exceed the thresholds proposed in Section 5.2. Table 2. Error statistics for validation with different tensor parallelism configurations, GPUs and attention kernel implementations. Generation Model Validation Model Max Top-k Mismatch Max Exponent Mismatch Max Mantissa Diff Mean Max Mantissa Diff Median 1x A100 1x A100 10 (7.81%) 16 (12.50%) 5.06 2 1x4090 10 (7.81%) 18 (14.06%) 4.68 2 2x4090 15 (11.72%) 19 (14.84%) 4.96 4 |
| Software Dependencies | No | The paper mentions several software components like vLLM, Hugging Face Transformers, Flash Attention 2, PyTorch Scaled Dot Product Attention, Flex Attention, and CUDA. However, it does not provide specific version numbers for any of these components, which is necessary for reproducible software dependencies. |
| Experiment Setup | Yes | For thresholds, we use Texp = 38, Tmean = 10 and Tmedian = 8 for bf16 inference and Texp = 8, Tmean = 256 and Tmedian = 128 for fp32 inference. These thresholds were chosen based on our analysis of the error statistics in Table 2 and Table 5. We use the bf16 precision for all our experiments, unless specified otherwise. |