reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen

Authors: Alessandro Palma, Till Richter, Hanyi Zhang, Manuel Lubetzki, Alexander Tong, Andrea Dittadi, Fabian Theis

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate CFGen across multiple biological datasets, demonstrating its advantages in generative performance and downstream applications. Our main contributions are as follows: We introduce CFGen... We show that our model s full-genome generative performance consistently outperforms existing single-cell generative models qualitatively and quantitatively on multiple biological datasets. We showcase the application of CFGen in enhancing downstream tasks, including robust data augmentation for improved classification of rare cell types and batch correction.
Researcher Affiliation	Academia	1Helmholtz Munich 2Technical University of Munich 3Université de Montréal 4Mila 5MPI for Intelligent Systems, Tübingen Correspondence to EMAIL
Pseudocode	Yes	Algorithm 1 Train CFGen with multiple attributes on sc RNA-seq Algorithm 2 Sampling from multi-attribute guided CFGen for sc RNA-seq
Open Source Code	Yes	We make the code for CFGen as well as the links to pre-processed datasets available at https: //github.com/theislab/CFGen.
Open Datasets	Yes	We evaluate the performance of CFGen conditionally and unconditionally against three baselines... (i) PBMC3K1 (2,638 cells from a healthy donor, clustering into 8 cell types), (ii) Dentate gyrus (La Manno et al., 2018) (18,213 cells from a developing mouse hippocampus), (iii) Tabula Muris (Tabula Muris Consortium et al., 2018) (245,389 cells from Mus musculus across multiple tissues), and (iv) Human Lung Cell Atlas (HLCA) (Sikkema et al., 2023) (584,944 cells from 486 individuals across 49 datasets). We use the multiome PBMC10K dataset, made available by 10X Genomics. We leverage two large datasets: (i) PBMC COVID (Yoshida et al., 2022) 422,220 blood cells from 93 patients ranging across paediatric and adult. (ii) The HLCA dataset described in Section 5.1. All datasets are publicly available, and their source publications are referenced in the main text.
Dataset Splits	Yes	We split the PBMC COVID and HLCA datasets into a training set and a held-out set, ensuring a more challenging generalization task by leaving out all cells from 20% of the donors in both datasets. This results in 80 training and 27 test donors for HLCA and 60 training and 15 test donors for PBMC COVID.
Hardware Specification	Yes	We empirically evaluate how different hyperparameters impact CFGen s runtime. To do so, we generate synthetic data using an untrained CFGen instance initialized with a specific configuration and run our experiments on an NVIDIA A100 GPU.
Software Dependencies	Yes	The CFGen model is implemented in PyTorch (Paszke et al., 2017), version 2.1.2. The integration is performed using the dopri5 solver with adjoint sensitivity and a tolerance of 1e-5 from the torchdyn package (Poli et al., 2021) in Python3 (Van Rossum & Drake, 2009).
Experiment Setup	Yes	For all settings, we set the learning rate to 0.001, with all layer pairs interleaved with one-dimensional batch normalization layers. We use the Adam W optimizer and the ELU activation function as the non-linearity. In the standard setting, we train the flow model for 1, 000 epochs using the Adam W optimizer, a learning rate of 0.001, and a batch size of 256.