reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Going Beyond Static: Understanding Shifts with Time-Series Attribution

Authors: Jiashuo Liu, Nabeel Seedat, Peng Cui, Mihaela van der Schaar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical studies in real-world healthcare applications highlight how the TSSA framework enhances the understanding of time-series shifts, facilitating reliable model deployment and driving targeted improvements from both algorithmic and data-centric perspectives.
Researcher Affiliation	Academia	Jiashuo Liu , Nabeel Seedat , Peng Cui & Mihaela van der Schaar Tsinghua University, University of Cambridge EMAIL, EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes methodologies and processes but does not include any explicitly labeled pseudocode or algorithm blocks. For example, it describes the TSSA framework's parts and the doubly robust estimator's formulation, but not in a structured pseudocode format.
Open Source Code	No	The paper does not provide any specific links to a code repository, an explicit statement of code release, or mention of code in supplementary materials.
Open Datasets	Yes	Through our experiments, we use the Medical Information Mart for Intensive Care (MIMIC) (Johnson et al., 2016) dataset.
Dataset Splits	Yes	We follow the standard design outlined by Jarrett et al., randomly splitting the patients in the MIMIC-III dataset into a training set (18,490 patients, P) and a test set (4,610 patients, Q), ensuring no patient overlap between the two sets. For the validation set, we use the same patients as in the training set but select different time segments for their time-series features, denoted as Pval. ... Specifically, for the training set P, we utilize the last 24-hour time segments for all time-series features, while for the test set Q, we select the first 24-hour time segments for all features. This setup allows us to assess whether the model can effectively withstand these temporal shifts and accurately identify patients at high risk of mortality in the early stage. We train a Transformer model fθ( ) on P, which comprises 12,574 patients, and validate it on an additional 5,547 patients. To control for other shifts, we use the same set of patients for both the validation and test sets Q; the only difference lies in the time segments used: the last 24 hours for validation and the first 24 hours for testing. ... We consider a realistic scenario in which a model trained on historical data (12,574 patients, first 24-hour time series, denoted as P) must be deployed for new patients and future time segments (an additional 5,547 patients, second 24-hour time series, denoted as Q).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or detailed computer specifications used for running its experiments. It only mentions using a Transformer model and an attribution model without hardware context.
Software Dependencies	No	The paper mentions using a 'Transformer model' and 'XGBoost' for comparison, but does not provide specific version numbers for any software libraries, frameworks, or programming languages.
Experiment Setup	Yes	As for the original model (under evaluation), we use Transformer model (n head:4, n layer:3, hidden dim:32), learning rate is 1e 3, the total epoch number is 200, batch size is 256, and the early stop is used during training (according to last 10 epoch). As for the attribution model: The model architecture is shown in Figure 2, where we use two-layer MLP with hidden size selected from {16,32,64,128} for each part according to the validation results, learning rate 1e 3, and batch size 64.