reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

Authors: Jaden Fiotto-Kaufman, Alexander Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla Brodley, Arjun Guha, Jonathan Bell, Byron Wallace, David Bau

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a quantitative survey of the machine learning literature that reveals a growing gap in the study of the internals of large-scale AI. We demonstrate the design and use of our framework to address this gap by enabling a range of research methods on huge models. Finally, we conduct benchmarks to compare performance with previous approaches.
Researcher Affiliation	Academia	1Northeastern University, 2TU Clausthal, 3University of Hamburg
Pseudocode	Yes	Code Example 1: Basic usage of the NNsight tracing context and the NNsight data type. Code Example 2: An intervention implemented with standard Py Torch hooks. Code Example 3: An intervention implemented with the NNsight API. Code Example 4: An NNsight implementation for attribution patching (Kram ar et al., 2024). Code Example 5: Using NNsight to train a LORA with remote execution, the Session context, and iterative interventions. Code Example 6: Applying backpropagation and accessing gradients with respect to a loss. Code Example 7: Zeroing out the grad of one layer, and doubling the grad of a second layer. Code Example 8: Sample code showing the remote training of a linear probe using NDIF. Code Example 9 shows how we generate sample prompts and requests for a single user, saving the response time of the request sent to and returned by NDIF.
Open Source Code	Yes	Code, documentation, and tutorials are available at https://nnsight.net/. The code for creating NDIF infrastructure is freely available on Git Hub at https://github.com/ ndif-team/ndif, allowing users to create their own NDIF infrastructure for specialized use cases. To promote transparency and ensure the reproducibility of our results, the complete NNsight and NDIF frameworks, along with comprehensive documentation and all relevant experiment code, are available at https://github.com/ndif-team/nnsight and https://github.com/ndif-team/ ndif, respectively.
Open Datasets	Yes	For this purpose, we use a single batch of 32 examples from the Indirect Object Identification (IOI) dataset (Wang et al., 2022).
Dataset Splits	No	The paper mentions using a "single batch of 32 examples from the Indirect Object Identification (IOI) dataset" and that "the sample size is n = 128" in a figure caption. Code examples also show generated datasets like `dataset = [["_", answer_token ]] * 100`. However, it does not provide explicit training/test/validation splits for any dataset used in its experiments, nor does it refer to standard predefined splits with sufficient detail.
Hardware Specification	Yes	The experiments were conducted on a single HPC node with four NVIDIA H100 PCIe 82GB GPUs with CUDA version 12.3 and an Intel(R) Xeon(R) Gold 6342 CPU with 24 cores. To ensure a fair comparison, we deployed private instances of both Petals and NDIF on a server with a single NVIDIA RTX A6000 49 GB GPUs with CUDA version 12.5 and an AMD Ryzen 9 5900X CPU with 12 cores. We use Llama-3.1-8B (Dubey et al., 2024) as the model being served on NDIF for this evaluation, which was hosted on a 48 GB RTX 6000 Ada GPU.
Software Dependencies	Yes	The experiments were conducted on a single HPC node with four NVIDIA H100 PCIe 82GB GPUs with CUDA version 12.3 and an Intel(R) Xeon(R) Gold 6342 CPU with 24 cores. We deployed private instances of both Petals and NDIF on a server with a single NVIDIA RTX A6000 49 GB GPUs with CUDA version 12.5 and an AMD Ryzen 9 5900X CPU with 12 cores.
Experiment Setup	Yes	Our evaluation focuses on the time required to load the model weights into memory, as well as the runtime of activation patching, a standard model intervention technique (Vig et al., 2020). For this purpose, we use a single batch of 32 examples from the Indirect Object Identification (IOI) dataset (Wang et al., 2022). In Code Example 5, it shows `optimizer = torch.optim.Adam W(lora.parameters (), lr=3)` and `dataset = [["_", answer_token ]] * 100` and `dataloader = DataLoader(dataset, batch_size =10)`. In Code Example 8, it shows `probe = torch.nn.Linear(features , features).save()` and `optimizer = torch.optim.Adam(probe.parameters (), lr =0.003)` and `for epoch in list(range (50)):` and `DataLoader(["some", "text", "to", "train", "on"], batch_size =2, shuffle=True)`.