reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Measuring Orthogonality in Representations of Generative Models

Authors: Robin C. Geyer, Alessandro Torcinovich, João B. S. Carvalho, Alexander Meyer, Joachim M. Buhmann

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Throughout extensive experiments on common downstream tasks, over several benchmark datasets and models, IWO and IWR consistently show stronger correlations with downstream task performance than traditional disentanglement metrics.
Researcher Affiliation	Academia	Robin C. Geyer EMAIL Department of Computer Science, ETH Zurich Alessandro Torcinovich EMAIL Faculty of Engineering, Free University of Bozen-Bolzano Department of Computer Science, ETH Zurich João B. Carvalho EMAIL Department of Computer Science, ETH Zurich Alexander Meyer EMAIL German Heart Center of the Charité Joachim M. Buhmann EMAIL Department of Computer Science, ETH Zurich
Pseudocode	Yes	This process is depicted in Figure 2, and a pseudocode implementation can be found in Appendix E.
Open Source Code	Yes	More details are listed in the appendix and in our open-source code implementation1. 1https://github.com/cyrusgeyer/iwo
Open Datasets	Yes	We consider six benchmark datasets, namely d Sprites, Color d Sprites and Scream d Sprites (Matthey et al., 2017), Cars3D (Reed et al., 2015), small NORB (Le Cun et al., 2004) and Shapes3D (Burgess & Kim, 2018).
Dataset Splits	Yes	The dataset is split into a training set with five instances of each category and a test set with the remaining five instances. Data is split into a training (80%) and a test set (20%). During training, part of the training set is used for validation, which is in turn used as an early stopping criterion.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. While Section I mentions "No GPU was used" for GCA, this is specific to the GCA pipeline and not the broader experimental setup for training VAE models, which is a major part of the work.
Software Dependencies	No	The paper mentions using "the PyTorch Lightning framework" and "the Adam optimization scheme" but does not specify version numbers for PyTorch Lightning or any other key software libraries, which are necessary for reproducibility.
Experiment Setup	Yes	The hyperparameters considered for each model are the following: β-VAE (Higgins et al., 2017): with β {1, 2, 4, 6, 8, 16} Annealed VAE (Burgess et al., 2018): with cmax {5, 10, 25, 50, 75, 100} β-TCVAE (Chen et al., 2018): with β {1, 2, 4, 6, 8, 10} Factor-VAE (Kim & Mnih, 2018): with γ {10, 20, 30, 40, 50, 100} DIP-VAE-I (Kumar et al., 2018): with λod {1, 2, 5, 10, 20, 50} DIP-VAE-II (Kumar et al., 2018): with λod {1, 2, 5, 10, 20, 50} We further use the Adam optimization scheme as proposed by Kingma & Ba (2015) with a learning rate of 5 10 4 and a batch size of 128 for all optimizations.