reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LAST SToP for Modeling Asynchronous Time Series

Authors: Shubham Gupta, Thibaut Durand, Graham W. Taylor, Lilian Bialokozowicz

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on real-world datasets, we demonstrate that our approach achieves state-of-the-art performance across different tasks and datasets. We conduct comprehensive evaluations on realworld datasets across multiple tasks to demonstrate the effectiveness of our proposed method.
Researcher Affiliation	Collaboration	1Mila Quebec AI Institute, Canada 2Universit e Laval, Canada 3RBC Borealis 4Vector Institute, Canada 5Electronic Arts. Correspondence to: Shubham Gupta <EMAIL>, Thibaut Durand <EMAIL>, Lilian W. Bialokozowicz <EMAIL>.
Pseudocode	No	The paper describes methods in textual form and through diagrams (e.g., Figure 1 and 3) but does not contain any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	1Code available at https://github.com/Borealis AI/last-stop.
Open Datasets	Yes	Datasets. We perform experiments on two different sets of datasets: three text-based action datasets and five standard temporal point process datasets... Breakfast (Kuehne et al., 2014)... EPIC-KITCHENS-100 (Damen et al., 2022)... Multi THUMOS (Yeung et al., 2018)... Amazon (Ni et al., 2019)... Retweet (Zhou et al., 2013)... Taxi (Whong, 2014)... Taobao (Xue et al., 2022)... Stack Overflow2 where the goal is to predict the timestamp and category (among 22 categories) of the next badges assigned to a given user. 2https://snap.stanford.edu/data/
Dataset Splits	Yes	Following (Xue et al., 2024), we split our datasets into a train/validation/test ratio of 70/10/20.
Hardware Specification	No	The paper mentions using "compute resources available to us" but does not specify any particular hardware like GPU models, CPU types, or memory.
Software Dependencies	No	We use Llama-3-8B-Instruct (Dubey et al., 2024) as our LLM backbone. While a specific LLM is mentioned, no specific version numbers for underlying software libraries (e.g., Python, PyTorch, CUDA) are provided.
Experiment Setup	Yes	For LLM adaptation experiments, we use QLo RA as the low rank adaptation algorithm, Adam as the optimizer, and a constant learning rate of 2e 4 for QLo RA and 1e 4 for prompt tuning. Following (Xue et al., 2024), we split our datasets into a train/validation/test ratio of 70/10/20. Both SP and Sto P training are conducted for the same number of epochs. We employ early stopping based on the Macro-F1 on the validation set. We report performance on the test set. We use a prompt length of 400 for prompt tuning in both SP and Sto P experiments... For QLo RA, we use a rank of 4, resulting in a comparable number of trainable parameters (1.7M).