reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Permissive Information-Flow Analysis for Large Language Models

Authors: Shoaib Ahmed Siddiqui, Radhika Gaonkar, Boris Köpf, David Krueger, Andrew Paverd, Ahmed Salem, Shruti Tople, Lukas Wutschitz, Menglin Xia, Santiago Zanella-Beguelin

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results in an LLM agent setting show that our label propagator assigns a more permissive label over the baseline in more than 85% of the cases, which underscores the practicality of our approach. We demonstrate the effectiveness of our proposed label propagator by evaluating it on three datasets
Researcher Affiliation	Collaboration	Shoaib Ahmed Siddiqui1, Radhika Gaonkar2, Boris Köpf2, David Krueger3, Andrew Paverd2, Ahmed Salem2, Shruti Tople2, Lukas Wutschitz2, Menglin Xia2, Santiago Zanella-Béguelin2 EMAIL EMAIL EMAIL EMAIL 1University of Cambridge 2Microsoft 3Mila
Pseudocode	Yes	Algorithm 1 describes this idea in pseudocode. We represent the powerset of possible labels as a directed acyclic graph (DAG) where nodes represent labels and edges represent the lattice order (see Figure 2 for an illustration). Starting from the root node L = F c C ℓ(c) that corresponds to the full context, we traverse the DAG depth-first to identify λ-similar labels. For each label L, the function minimal_labels() returns Λ, the set of λ-similar labels at or below L (i.e., at least as permissive as L).
Open Source Code	No	The paper does not provide an explicit statement about releasing code or a direct link to a code repository for the methodology described.
Open Datasets	Yes	Starting from an existing fake news dataset (Pérez-Rosas et al., 2018), we create pairs of high and low-integrity news articles that discuss similar topics to each other.
Dataset Splits	Yes	For the introspection baseline, we use one-shot in-context learning (Wei et al., 2022) based on the first example in the dataset and use the remaining 63 questions for evaluation.
Hardware Specification	No	The paper mentions using Llama-2 model family (7B and 70B variants) but does not specify the hardware (e.g., specific GPU models, CPUs) on which these models were run for their experiments.
Software Dependencies	No	The paper mentions using Llama-2 models and GPT-4 for data generation, and metrics like ROUGE-L, but it does not specify any software libraries, frameworks, or programming languages with their version numbers.
Experiment Setup	Yes	We fix a threshold of λ = 0.2 for this experiment. We use the base model without instruction tuning in this case due to the use of a custom chat format. For the k NN-LM implementation, we follow (Khandelwal et al., 2020) and use the model s penultimate layer representation of the last token (conditioned on all preceding tokens in the document) as the context representation for k NN search. k NN-LM prediction uses γ = 0.5.