Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen
Authors: Alessandro Palma, Till Richter, Hanyi Zhang, Manuel Lubetzki, Alexander Tong, Andrea Dittadi, Fabian Theis
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate CFGen across multiple biological datasets, demonstrating its advantages in generative performance and downstream applications. Our main contributions are as follows: We introduce CFGen... We show that our model s full-genome generative performance consistently outperforms existing single-cell generative models qualitatively and quantitatively on multiple biological datasets. We showcase the application of CFGen in enhancing downstream tasks, including robust data augmentation for improved classification of rare cell types and batch correction. |
| Researcher Affiliation | Academia | 1Helmholtz Munich 2Technical University of Munich 3Université de Montréal 4Mila 5MPI for Intelligent Systems, Tübingen Correspondence to EMAIL |
| Pseudocode | Yes | Algorithm 1 Train CFGen with multiple attributes on sc RNA-seq Algorithm 2 Sampling from multi-attribute guided CFGen for sc RNA-seq |
| Open Source Code | Yes | We make the code for CFGen as well as the links to pre-processed datasets available at https: //github.com/theislab/CFGen. |
| Open Datasets | Yes | We evaluate the performance of CFGen conditionally and unconditionally against three baselines... (i) PBMC3K1 (2,638 cells from a healthy donor, clustering into 8 cell types), (ii) Dentate gyrus (La Manno et al., 2018) (18,213 cells from a developing mouse hippocampus), (iii) Tabula Muris (Tabula Muris Consortium et al., 2018) (245,389 cells from Mus musculus across multiple tissues), and (iv) Human Lung Cell Atlas (HLCA) (Sikkema et al., 2023) (584,944 cells from 486 individuals across 49 datasets). We use the multiome PBMC10K dataset, made available by 10X Genomics. We leverage two large datasets: (i) PBMC COVID (Yoshida et al., 2022) 422,220 blood cells from 93 patients ranging across paediatric and adult. (ii) The HLCA dataset described in Section 5.1. All datasets are publicly available, and their source publications are referenced in the main text. |
| Dataset Splits | Yes | We split the PBMC COVID and HLCA datasets into a training set and a held-out set, ensuring a more challenging generalization task by leaving out all cells from 20% of the donors in both datasets. This results in 80 training and 27 test donors for HLCA and 60 training and 15 test donors for PBMC COVID. |
| Hardware Specification | Yes | We empirically evaluate how different hyperparameters impact CFGen s runtime. To do so, we generate synthetic data using an untrained CFGen instance initialized with a specific configuration and run our experiments on an NVIDIA A100 GPU. |
| Software Dependencies | Yes | The CFGen model is implemented in PyTorch (Paszke et al., 2017), version 2.1.2. The integration is performed using the dopri5 solver with adjoint sensitivity and a tolerance of 1e-5 from the torchdyn package (Poli et al., 2021) in Python3 (Van Rossum & Drake, 2009). |
| Experiment Setup | Yes | For all settings, we set the learning rate to 0.001, with all layer pairs interleaved with one-dimensional batch normalization layers. We use the Adam W optimizer and the ELU activation function as the non-linearity. In the standard setting, we train the flow model for 1, 000 epochs using the Adam W optimizer, a learning rate of 0.001, and a batch size of 256. |