reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

WOODS: Benchmarks for Out-of-Distribution Generalization in Time Series

Authors: Jean-Christophe Gagnon-Audet, Kartik Ahuja, Mohammad Javad Darvishi Bayazi, Pooneh Mousavi, Guillaume Dumas, Irina Rish

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We revise the existing OOD generalization algorithms for time series tasks and evaluate them using our systematic framework. Our experiments show a large room for improvement for empirical risk minimization and OOD generalization algorithms on our datasets, thus underscoring the new challenges posed by time series tasks. We conduct extensive experiments on the above datasets with ERM and various OOD generalization algorithms. Our findings lead us to conclude that OOD generalization in time series brings its own set of challenges and that there is a large room for improvement as shown in Table 1.
Researcher Affiliation	Academia	Jean-Christophe Gagnon-Audet EMAIL Mila Québec AI Institute University of Montreal Kartik Ahuja EMAIL Mila Québec AI Institute University of Montreal Mohammad-Javad Darvishi-Bayazi EMAIL Mila Québec AI Institute University of Montreal Pooneh Mousavi EMAIL Gina Cody School of Engineering and Computer Science Concordia University Guillaume Dumas EMAIL Mila Québec AI Institute CHU Sainte-Justine Research Center, Department of Psychiatry University of Montreal Irina Rish EMAIL Mila Québec AI Institute University of Montreal
Pseudocode	No	The paper describes adaptations of OOD generalization algorithms and uses mathematical formulations like Equation (1) and (2) in Section 5 and Appendix D. It also refers to algorithms by name (e.g., IRM, VREx, Group DRO) but does not present any of its own methodology or the adapted algorithms in structured pseudocode or algorithm blocks.
Open Source Code	No	The paper provides links to third-party implementations of models or toolboxes that were used (e.g., 'https://github.com/TNTLFreiburg/braindecode' for Brain Decode Toolbox, 'https://github.com/declare-lab/conv-emotion/tree/master/Dialogue RNN' for Dialogue RNN model). It also mentions a 'WOODS repository' for datasets (e.g., 'We provide the Basic-Fourier dataset in the WOODS repository.'). However, it does not contain an explicit statement by the authors about releasing the source code for their own WOODS framework or the adaptations of the OOD generalization algorithms they describe.
Open Datasets	Yes	We propose WOODS: a benchmark of 3 synthetic challenge and 8 real-world datasets... CAP (Terzano et al., 2001; Goldberger et al., 2000) dataset... SEDFx (Kemp et al., 2000; Goldberger et al., 2000) dataset... PCL (Lee et al., 2019; Cho et al., 2017; Schalk et al., 2004; Jayaram & Barachant, 2018) dataset... LSA64 (Ronchetti et al., 2016) dataset... LSA64 is openly distributed by the author at http://facundoq.github.io/datasets/lsa64/... HHAR (Stisen et al., 2015; Dua & Graﬀ, 2017) dataset... Ped Count (City of Melbourne, 2017; Godahewa et al., 2021) dataset... Aus Elec (Hyndman & Athanasopoulos, 2018; Godahewa et al., 2021) dataset... IEMOCAP (Bulut et al., 2008) dataset... license availabel at https://sail.usc.edu/iemocap/iemocap_release.htm.
Dataset Splits	Yes	For domain generalization We split all the training domains into training and validation sets. For subpopulation shift We split all domains into training, validation and test sets. The ratios for the rare emotion-shift domain are 1/6, 1/6, and 2/3 for training, validation, and test respectively. For the remaining domains, dialogues are randomly chosen to achieve the ratios of 0.7, 0.1, and 0.2 for the size of training, validation, and test respectively.
Hardware Specification	No	The paper does not explicitly mention any specific hardware used for running experiments, such as GPU models, CPU models, or cloud computing instance types with their specifications.
Software Dependencies	No	The paper mentions several software tools and libraries used (e.g., PyTorch Video, Brain Decode Toolbox, open SMILE), but it does not provide specific version numbers for any of them, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	Our framework follows the Domain Bed (Gulrajani & Lopez-Paz, 2020) workﬂow for hyperparameter search and model selection for a fair and systematic evaluation of OOD generalization algorithms. We perform a random search over 20 hyperparameter conﬁgurations, which we repeat three times for error estimation. We then report the performance of the model chosen with our model selection methods (see Section 6.1)... All hyperparameter searches in this work use random searches (Bergstra & Bengio, 2012) over the hyperparameter distribution spaces deﬁned in Table 51 and Table 52... Table 51: Distributions of training hyperparameters for random search (learning rate, batch size, class balance). Table 52: Distributions of algorithm hyperparameters for random search (penalty weight, annealing iterations, η, λ, temperature, δ, adv lr, adv steps, α).