PICASO: Permutation-Invariant Context Composition with State Space Models

Authors: Tian Yu Liu, Alessandro Achille, Matthew Trager, Aditya Golatkar, Luca Zancato, Stefano Soatto

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our resulting method on Wiki Text and MSMARCO in both zero-shot and fine-tuned settings, and show that we can match the strongest performing baseline while enjoying on average 5.4 speedup.
Researcher Affiliation Collaboration Tian Yu Liu UCLA Alessandro Achille AWS AI Labs Matthew Trager AWS AI Labs Aditya Golatkar AWS AI Labs Luca Zancato AWS AI Labs Stefano Soatto AWS AI Labs
Pseudocode Yes A ALGORITHMS: PICASO-S AND PICASO-R We show in Algorithm 1 how PICASO-S is computed in polynomial time via a dynamic programming approach based on Algorithm 2. In Algorithm 3, we also show how PICASO-R can be computed with linear time complexity. Time complexity is measured as the number of arithmetic operations required as a function of number of context states.
Open Source Code No The paper references third-party tools and models (e.g., HuggingFace, Mamba-2, Sentence-Transformers) and their repositories, but does not provide specific links or statements regarding the release of the authors' own implementation code for PICASO.
Open Datasets Yes We evaluate our method on two large-scale datasets Wiki Text-V2 (Merity et al., 2016) and MSMARCO (Nguyen et al., 2016).
Dataset Splits Yes We use the training splits as our fine-tuning data, and the testing/validation splits respectively for evaluation. To pre-process Wiki Text-V2 for our use case, we split each passage in the dataset into two equal context segments... For Wiki Text, we select k {0, . . . , 10} uniformly at random for each batch. For MSMARCO, we use all the available passages (both relevant and irrelevant) associated with each training example.
Hardware Specification Yes We used the official benchmark2 with an A100 GPU for our timing experiments in Figure 1 to ensure fairest comparisons.
Software Dependencies No The paper mentions software like 'Hugging Face trainer' and 'Mamba-2 2.7B model', and uses 'CUDA graphs' and 'flash attention', but does not provide specific version numbers for these software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes For fine-tuning experiments using BPTC and BP2C, we base our implementation on the official Hugging Face trainer with default hyperparameters, and retrieve the k most relevant context segments for each query sample for composition. For Wiki Text, we select k {0, . . . , 10} uniformly at random for each batch. For both datasets, we fine-tune for only 1 epoch.