reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hierarchically branched diffusion models leverage dataset structure for class-conditional generation

Authors: Alex M Tseng, Max W Shen, Tommaso Biancalani, Gabriele Scalia

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively evaluate branched diffusion models on several benchmark and large real-world scientific datasets, spanning different data modalities (images, tabular data, and graphs). We particularly highlight the advantages of branched diffusion models on a single-cell RNA-seq dataset, where our branched model leverages the intrinsic hierarchical structure between human cell types.
Researcher Affiliation	Industry	Alex M. Tseng EMAIL Max Shen EMAIL Tommaso Biancalani EMAIL Gabriele Scalia EMAIL Biology Research \| AI Development Genentech
Pseudocode	Yes	Algorithm 1 Training a branched diffusion model... Algorithm 2 Sampling a branched diffusion model
Open Source Code	No	The paper does not provide a specific link to source code developed for this methodology, nor does it explicitly state that their code will be made publicly available. It mentions using external tools like Scanpy, Cell Typist, scVI, RDKit, but not their own implementation code for the branched diffusion models.
Open Datasets	Yes	We demonstrate branched diffusion models on several datasets of different data modalities: 1) MNIST handwritten-digit images (Le Cun et al.); 2) a tabular dataset of several features for the 26 English letters in various fonts (Frey & Slate, 1991); 3) a real-world, large scientific dataset of single-cell RNA-seq, measuring the gene expression levels of many blood cell types in COVID-19 patients, influenza patients, and healthy donors (Lee et al., 2020); and 4) ZINC250K, a large dataset of 250K real drug-like molecules (Irwin et al., 2012). ... We downloaded the MNIST dataset and used all digits from http://yann.lecun.com/exdb/mnist/ (Le Cun et al.). ... We downloaded the tabular letter-recognition dataset from the UCI repository: https://archive.ics.uci.edu/ml/datasets/Letter+Recognition (Frey & Slate, 1991). ... We downloaded the single-cell RNA-seq dataset from GEO (GSE149689) (Lee et al., 2020).
Dataset Splits	No	The paper describes how samples were used for evaluation (e.g., 'generated 1000 samples of each class from the branched model... and randomly selected 1000 samples of each class from the true dataset') and mentions 'Sample (x0, c) from training data {(x(k), c(k))}' in Algorithm 1, but it does not specify explicit reproducible training, validation, or test splits (e.g., exact percentages or counts for model training) in the main text or supplementary methods for the datasets used.
Hardware Specification	Yes	We trained all of our models and performed all analyses on a single Nvidia Quadro P6000.
Software Dependencies	No	We used Scanpy to pre-process the data, using a standard workflow... We assigned cell-type labels using Cell Typist... To train our diffusion models, we projected the gene expressions down to a latent space of 200 dimensions, using the linearly decoded variational autoencoder in scVI (Gayoso et al., 2022). ... We downloaded the ZINC250K dataset and converted the SMILES strings into molecular graphs using RDKit. ... We used the Fashion MNIST dataset as loaded from Torch Vision. The paper mentions these software packages but does not provide specific version numbers for them.
Experiment Setup	Yes	For all of our models, we trained with a batch size of 128 examples, drawing uniformly from the entire dataset. ... For all of our models, we used a learning rate of 0.001, and trained our models until the loss had converged. For our label-guided MNIST model, we trained for 30 epochs. ... For our branched continuous-time MNIST model, we trained for 90 epochs. ... The autoencoder was trained for 500 epochs, with a learning rate of 0.005.