ViTally Consistent: Scaling Biological Representation Learning for Cell Microscopy

Authors: Kian Kenyon-Dean, Zitong Jerry Wang, John Urbanik, Konstantin Donhauser, Jason Hartford, Saber Saberian, Nil Sahin, Ihab Bendidi, Safiye Celik, Juan Sebastián Rodrı́guez Vera, Marta Fay, Imran S Haque, Oren Kraus

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using this strategy, we present the largest foundation model for cell microscopy data to our knowledge, a new 1.9 billion-parameter Vi T-G/8 MAE trained on over 8 billion microscopy image crops. Compared to a previous published Vi T-L/8 MAE, our new model achieves a 60% improvement in linear separability of genetic perturbations and obtains the best overall performance on whole-genome relationship recall, batch correction replicate consistency, and compound-gene activity prediction benchmarks.
Researcher Affiliation Collaboration Kian Kenyon-Dean 1 Zitong Jerry Wang 1 John Urbanik 1 Konstantin Donhauser 2 Jason Hartford 2 3 Saber Saberian 1 Nil Sahin 1 Ihab Bendidi 2 Safiye Celik 1 Juan Sebasti an Rodr ıguez Vera 1 Marta Fay 1 Imran S Haque 1 Oren Kraus 1 1Recursion 2Valence Labs 3University of Manchester. Correspondence to: Kian Kenyon-Dean <EMAIL>, Oren Kraus <EMAIL>.
Pseudocode No The paper describes methods and procedures narratively and through mathematical equations but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We publicly release the inference, reconstruction visualization and benchmarking code3, along with the full weights for CA-MAE Vi T-S/164. 3https://github.com/recursionpharma/maes_ microscopy 4https://huggingface.co/recursionpharma/ Open Phenom
Open Datasets Yes We train and evaluate various vision transformers (Vi Ts, Table 4) as encoders to extract feature embeddings from 256 256 6 (Hx Wx C) microscopy image crops (Figure 2). Our block-wise search consists of training a logistic regression model (linear probe) on the output features of each transformer block to predict either the gene that was perturbed or the functional group that the gene belongs to, and test performance on held-out experiments ( A.4). ...Rx Rx1 1139-class si RNA genetic perturbation classification. We expect high quality representations of cell images to generate similar embeddings for cells with the same perturbation, hence a simple linear probe should be able to predict gene perturbation from these representation reasonably well. We train linear probes on the publicly-available Rx Rx1 dataset (Sypetkowski et al., 2023) which consists of 125,510 high-resolution fluorescence microscopy images of human cells under 1,138 si RNA-induced gene knockdowns (plus unperturbed controls) across four cell types (Hep G2, HUVEC, U2OS, RPE). In order to validate that the MAEs generalize to entirely novel data, we evaluated a subset of models on completely external public data generated by different assays and from a variety of different labs as produced by the JUMP-CP consortium (Chandrasekaran et al., 2023). Rx Rx3-core2 (Kraus et al., 2025) is a publicly available benchmarking dataset for assessing biological capabilities of computer vision models. Rx Rx3-core includes labeled images (compressed to JPEG-2000) of 735 genetic knockouts and 1,674 small-molecule perturbations across eight concentrations drawn from 222,601 wells (512 512 6 pixel center-crops) drawn from the larger Rx Rx3 dataset. 2https://huggingface.co/datasets/ recursionpharma/rxrx3-core
Dataset Splits No The paper mentions strategies for splitting data, such as "The data was split by experiments, ensuring that the test data originated from experiments distinct from those used for training" (A.4), and refers to using "held-out experiments" or "a small subset of 80,000 wells from Rx Rx3... to evaluate linear probes" (Section 4). However, it does not provide specific percentages, absolute sample counts, or explicit references to predefined splits for training, validation, and testing that would be needed to reproduce the data partitioning precisely.
Hardware Specification Yes By combining these insights, we are able to train a new foundation model, MAE-G/8, a 1.9 billion parameter Vi T-G/8 MAE trained on Phenoprints-16M for 48,000 H100 GPU hours on more than 8 billion samples drawn from the curated dataset (Figure 1A, 3.2) resulting in significant improvements across a range of challenging biological benchmarks. Training this model required 256 H100 GPUs running in parallel for over 1 week. Table 5: Training hyperparameters for the new models presented in this work. ... # GPUs 16 A100s 128 H100s 256 H100s
Software Dependencies No The paper mentions using specific software components like the "Lion optimizer from (Chen et al., 2023)" (Table 5) and "scikit-learn library" (A.4 Training linear probes). However, it does not provide specific version numbers for these libraries or other critical software dependencies (e.g., Python, PyTorch, CUDA versions) needed for exact replication.
Experiment Setup Yes Our primary point of comparison is with respect to the best pretrained foundation model presented by Kraus et al. (2024), the MAE-Vi T-L/8+ trained on RPI-93M. This MAE-L/8 was trained for approximately 40 epochs, learning from over 3.5 billion image crops, using the L2 mean squared error loss function plus an additional Fourier domain reconstruction loss term. MAE-L/8 trained on Phenoprints-16M. Holding the model backbone constant compared to the MAE-Vi T-L/8 by (Kraus et al., 2024), we assess the impact of our curated dataset in contrast to the 93M dataset by training a new Vi T-L/8 MAE for 500 epochs on Phenoprints-16M. MAE-G/8 trained on Phenoprints-16M. ... training a new Vi T-Gigantic MAE with nearly 1.9 billion parameters for 500 epochs on Phenoprints-16M. ... See A.2 for other hyperparameter settings we used for model training. Table 5: Training hyperparameters for the new models presented in this work. Each used a one-cycle cosine learning rate decay schedule with 10% warm-up using the Lion optimizer from (Chen et al., 2023) with betas (0.9, 0.95) and weight decay of 0.05, with additional Vi T settings such as Layer Scale as proposed by (Dehghani et al., 2023). ...Hyperparameter ... Training epochs 100 500 500 ... Learning rate 1e-4 3e-5 3e-5 ... Global batch size 2048 16384 8192 ... Stochastic depth 0.1 0.3 0.6 A.2. Training hyperparameters ... Each model was trained using a 75% mask ratio and the standard decoder architecture for MAEs (He et al., 2022). Each model was trained with the standard L2 MAE loss and the Fourier-space loss function implemented by (Kraus et al., 2024) with a weight of α = 0.01.