reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference

Authors: Jack Min Ong, Matthew Di Ferrante, Aaron Pazdera, Ryan Garner, Sami Jaghouar, Manveer Basra, Max Ryabinin, Johannes Hagemann

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach is robust across diverse hardware configurations, GPU types, and algebraic reorderings, which allows for validation speeds significantly faster than the original inference. By introducing a polynomial encoding scheme, TOPLOC minimizes the memory overhead of the generated proofs by 1000 , requiring only 258 bytes of storage per 32 new tokens, compared to the 262 KB requirement of storing the token embeddings directly for Llama 3.1-8B-Instruct. Our method empowers users to verify LLM inference computations efficiently, fostering greater trust and transparency in open ecosystems and laying a foundation for decentralized, verifiable and trustless AI services.
Researcher Affiliation	Industry	1Prime Intellect 2Together AI.
Pseudocode	Yes	Algorithm 1 TOPLOC Proof Generation Algorithm Algorithm 2 TOPLOC Proof Validation Algorithm
Open Source Code	Yes	The corresponding implementations are also available on Git Hub4. 4github.com/Prime Intellect-ai/toploc
Open Datasets	Yes	For our experiments, we use the Ultra Chat dataset (Ding et al., 2023). The Ultra Chat dataset contains 1.4 million dialogues consisting of real-world inquiries, creative writing prompts, and various other text-based tasks such as rewriting, continuation, summarization, and inference, covering a wide range of topics.
Dataset Splits	No	The paper mentions using the Ultra Chat dataset and running 2000 queries or 400 prompts. However, it does not specify any explicit training, validation, or test splits, nor does it provide percentages or sample counts for these splits. The dataset is used for generation and validation tasks without detailed partitioning information for reproducibility.
Hardware Specification	Yes	We report the worst-case error statistics for different tensor parallelism and GPU combinations in Table 2. Here, none of the error statistics exceed the thresholds proposed in Section 5.2. Table 2. Error statistics for validation with different tensor parallelism configurations, GPUs and attention kernel implementations. Generation Model Validation Model Max Top-k Mismatch Max Exponent Mismatch Max Mantissa Diff Mean Max Mantissa Diff Median 1x A100 1x A100 10 (7.81%) 16 (12.50%) 5.06 2 1x4090 10 (7.81%) 18 (14.06%) 4.68 2 2x4090 15 (11.72%) 19 (14.84%) 4.96 4
Software Dependencies	No	The paper mentions several software components like vLLM, Hugging Face Transformers, Flash Attention 2, PyTorch Scaled Dot Product Attention, Flex Attention, and CUDA. However, it does not provide specific version numbers for any of these components, which is necessary for reproducible software dependencies.
Experiment Setup	Yes	For thresholds, we use Texp = 38, Tmean = 10 and Tmedian = 8 for bf16 inference and Texp = 8, Tmean = 256 and Tmedian = 128 for fp32 inference. These thresholds were chosen based on our analysis of the error statistics in Table 2 and Table 5. We use the bf16 precision for all our experiments, unless specified otherwise.