Hierarchically branched diffusion models leverage dataset structure for class-conditional generation
Authors: Alex M Tseng, Max W Shen, Tommaso Biancalani, Gabriele Scalia
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively evaluate branched diffusion models on several benchmark and large real-world scientific datasets, spanning different data modalities (images, tabular data, and graphs). We particularly highlight the advantages of branched diffusion models on a single-cell RNA-seq dataset, where our branched model leverages the intrinsic hierarchical structure between human cell types. |
| Researcher Affiliation | Industry | Alex M. Tseng EMAIL Max Shen EMAIL Tommaso Biancalani EMAIL Gabriele Scalia EMAIL Biology Research | AI Development Genentech |
| Pseudocode | Yes | Algorithm 1 Training a branched diffusion model... Algorithm 2 Sampling a branched diffusion model |
| Open Source Code | No | The paper does not provide a specific link to source code developed for this methodology, nor does it explicitly state that their code will be made publicly available. It mentions using external tools like Scanpy, Cell Typist, scVI, RDKit, but not their own implementation code for the branched diffusion models. |
| Open Datasets | Yes | We demonstrate branched diffusion models on several datasets of different data modalities: 1) MNIST handwritten-digit images (Le Cun et al.); 2) a tabular dataset of several features for the 26 English letters in various fonts (Frey & Slate, 1991); 3) a real-world, large scientific dataset of single-cell RNA-seq, measuring the gene expression levels of many blood cell types in COVID-19 patients, influenza patients, and healthy donors (Lee et al., 2020); and 4) ZINC250K, a large dataset of 250K real drug-like molecules (Irwin et al., 2012). ... We downloaded the MNIST dataset and used all digits from http://yann.lecun.com/exdb/mnist/ (Le Cun et al.). ... We downloaded the tabular letter-recognition dataset from the UCI repository: https://archive.ics.uci.edu/ml/datasets/Letter+Recognition (Frey & Slate, 1991). ... We downloaded the single-cell RNA-seq dataset from GEO (GSE149689) (Lee et al., 2020). |
| Dataset Splits | No | The paper describes how samples were used for evaluation (e.g., 'generated 1000 samples of each class from the branched model... and randomly selected 1000 samples of each class from the true dataset') and mentions 'Sample (x0, c) from training data {(x(k), c(k))}' in Algorithm 1, but it does not specify explicit reproducible training, validation, or test splits (e.g., exact percentages or counts for model training) in the main text or supplementary methods for the datasets used. |
| Hardware Specification | Yes | We trained all of our models and performed all analyses on a single Nvidia Quadro P6000. |
| Software Dependencies | No | We used Scanpy to pre-process the data, using a standard workflow... We assigned cell-type labels using Cell Typist... To train our diffusion models, we projected the gene expressions down to a latent space of 200 dimensions, using the linearly decoded variational autoencoder in scVI (Gayoso et al., 2022). ... We downloaded the ZINC250K dataset and converted the SMILES strings into molecular graphs using RDKit. ... We used the Fashion MNIST dataset as loaded from Torch Vision. The paper mentions these software packages but does not provide specific version numbers for them. |
| Experiment Setup | Yes | For all of our models, we trained with a batch size of 128 examples, drawing uniformly from the entire dataset. ... For all of our models, we used a learning rate of 0.001, and trained our models until the loss had converged. For our label-guided MNIST model, we trained for 30 epochs. ... For our branched continuous-time MNIST model, we trained for 90 epochs. ... The autoencoder was trained for 500 epochs, with a learning rate of 0.005. |