reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Do causal predictors generalize better to new domains?

Authors: Vivian Nastl, Moritz Hardt

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study how well machine learning models trained on causal features generalize across domains. We consider 16 prediction tasks on tabular datasets...allowing us to test how well a model trained in one domain performs in another.
Researcher Affiliation	Academia	Vivian Y. Nastl Max Planck Institute for Intelligent Systems, Tübingen, Germany and Tübingen AI Center Max Planck ETH Center for Learning Systems EMAIL Moritz Hardt Max Planck Institute for Intelligent Systems, Tübingen, Germany and Tübingen AI Center EMAIL
Pseudocode	No	The paper describes experimental procedures and methods in paragraph text and figures, but it does not include formal pseudocode blocks or algorithm listings.
Open Source Code	Yes	Our code is based on Gardner et al. [2023], Hardt and Kim [2023] and Gulrajani and Lopez-Paz [2020]. It is available at https://github.com/socialfoundations/causal-features.
Open Datasets	Yes	We consider 16 prediction tasks on tabular datasets from prior work [Ding et al., 2021, Hardt and Kim, 2023, Gardner et al., 2023]...Table 1: Description of tasks, data sources and number of features in each selection.
Dataset Splits	Yes	We have a train/test/validation split within the in-domain set, and a test/validation split within the out-of-domain set.
Hardware Specification	Yes	Each job was given the same computing resources: 1 CPU. Compute nodes use AMD EPYC 7662 64-core CPUs. Memory was allocated as required for each task: all jobs were allocated at least 128GB of RAM; for the tasks Public Coverage jobs were allocated 384GB of RAM.
Software Dependencies	No	The paper mentions several software components and libraries, such as 'Hyper Opt [Bergstra et al., 2013]' and machine learning algorithms (XGBoost, Light GBM, IRM, REx, etc.), but it does not specify their version numbers.
Experiment Setup	Yes	We conduct a hyperparameter sweep using Hyper Opt [Bergstra et al., 2013] on the in-domain validation data. A method is tuned for 50 trials. We exclusively train on the training set.