reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PICASO: Permutation-Invariant Context Composition with State Space Models

Authors: Tian Yu Liu, Alessandro Achille, Matthew Trager, Aditya Golatkar, Luca Zancato, Stefano Soatto

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our resulting method on Wiki Text and MSMARCO in both zero-shot and fine-tuned settings, and show that we can match the strongest performing baseline while enjoying on average 5.4 speedup.
Researcher Affiliation	Collaboration	Tian Yu Liu UCLA Alessandro Achille AWS AI Labs Matthew Trager AWS AI Labs Aditya Golatkar AWS AI Labs Luca Zancato AWS AI Labs Stefano Soatto AWS AI Labs
Pseudocode	Yes	A ALGORITHMS: PICASO-S AND PICASO-R We show in Algorithm 1 how PICASO-S is computed in polynomial time via a dynamic programming approach based on Algorithm 2. In Algorithm 3, we also show how PICASO-R can be computed with linear time complexity. Time complexity is measured as the number of arithmetic operations required as a function of number of context states.
Open Source Code	No	The paper references third-party tools and models (e.g., HuggingFace, Mamba-2, Sentence-Transformers) and their repositories, but does not provide specific links or statements regarding the release of the authors' own implementation code for PICASO.
Open Datasets	Yes	We evaluate our method on two large-scale datasets Wiki Text-V2 (Merity et al., 2016) and MSMARCO (Nguyen et al., 2016).
Dataset Splits	Yes	We use the training splits as our fine-tuning data, and the testing/validation splits respectively for evaluation. To pre-process Wiki Text-V2 for our use case, we split each passage in the dataset into two equal context segments... For Wiki Text, we select k {0, . . . , 10} uniformly at random for each batch. For MSMARCO, we use all the available passages (both relevant and irrelevant) associated with each training example.
Hardware Specification	Yes	We used the official benchmark2 with an A100 GPU for our timing experiments in Figure 1 to ensure fairest comparisons.
Software Dependencies	No	The paper mentions software like 'Hugging Face trainer' and 'Mamba-2 2.7B model', and uses 'CUDA graphs' and 'flash attention', but does not provide specific version numbers for these software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For fine-tuning experiments using BPTC and BP2C, we base our implementation on the official Hugging Face trainer with default hyperparameters, and retrieve the k most relevant context segments for each query sample for composition. For Wiki Text, we select k {0, . . . , 10} uniformly at random for each batch. For both datasets, we fine-tune for only 1 epoch.