Bridging Protein Sequences and Microscopy Images with Unified Diffusion Models

Authors: Dihan Zheng, Bo Huang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that CELL-Diff outperforms existing methods in generating high-fidelity protein images, making it a practical tool for investigating subcellular protein localization and interactions. We train CELL-Diff on the Human Protein Atlas (HPA) dataset and fine-tune it on the Open Cell dataset. Experimental results show that our model generates more detailed and sharper protein images than previous methods.
Researcher Affiliation Academia 1Department of Pharmaceutical Chemistry, UCSF, San Francisco, CA 94143 2Department of Biochemistry and Biophysics, UCSF, San Francisco, CA 94143 3Chan Zuckerberg Biohub San Francisco, San Francisco, CA 94158. Correspondence to: Bo Huang <EMAIL>.
Pseudocode Yes Algorithm 1 Training OA-ARDM Require: Network fθ, datapoint S. Ensure: LOA-ARDM. 1: Sample t U(1, . . . , D). 2: Sample σ U(SD). 3: Compute m (σ < t). 4: Compute l (1 m) log C(S|fθ(m S)). 5: LOA-ARDM 1 D t+1sum(l). Algorithm 2 Sampling from OA-ARDM Require: Network fθ. Ensure: Sample S. 1: Initialize S = 0, sample σ U(SD). 2: for t = 0, 1, 2, . . . , D do 3: m (σ < t) and n (σ = t). 4: S C(S|fθ(m S)). 5: S (1 n) S + n S . 6: end for
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described.
Open Datasets Yes The Human Protein Atlas (HPA) dataset (Digre & Lindskog, 2021) includes immunofluorescence images across various human cell lines with the proteins of interest stained by antibodies. ... The Open Cell (Cho et al., 2022) dataset provides a library of 1,311 CRISPR-edited HEK293T human cell lines...
Dataset Splits Yes Given the size limitations of the HPA and Open Cell datasets, particularly in the diversity of protein sequences, we randomly selected 100 proteins from the shared subset between the two datasets as the test set, leaving the remainder for training. The test set for HPA and Open Cell contains 714 and 473 data points, respectively.
Hardware Specification Yes All models are trained using two Nvidia H200 GPUs.
Software Dependencies No The paper mentions software components like VAE, ESM2 model, U-Net, Stable Diffusion, and Adam optimizer but does not specify their version numbers or other crucial library versions needed for reproduction.
Experiment Setup Yes For the HPA dataset, images are randomly cropped to a size of 1024 and then resized to 256, while images from the Open Cell dataset are directly cropped to 256 pixels. Data augmentation is applied using random flips and rotations. The latent representation has dimensions of 4 64 64. The KL loss coefficient is set to 1 10 5. The learning rate is initialized using a linear warm-up strategy, increasing from 0 to 3 10 4 over the first 1,000 iterations, followed by a linear decay to zero. The batch size is set to 192. The VAE is trained for a total of 50,000 steps on the HPA dataset and fine-tuned for 20,000 steps on the Open Cell dataset. Next, we fix the VAE model and train the latent diffusion model. CELL-Diff is pre-trained on the HPA dataset and fine-tune on the Open Cell dataset. Both pre-training and fine-tuning are conducted for 50,000 iterations using the Adam optimizer (Kingma & Ba, 2014). The learning rate is initialized using a linear warm-up strategy, increasing from 0 to 1 10 4 over the first 1,000 iterations, followed by a linear decay to zero. The batch size is set to 64. The sequence embedding dimension is 1280, and the bidirectional transformer module consists of 8 layers with 8-head attention. CELL-Diff is trained with 200 diffusion steps using the cosine noise schedules (Peebles & Xie, 2023), and use DDIM (Song et al., 2020) with 100 steps to accelerate the sampling speed. The weighting coefficient λ in (12) is set to 1, and the maximum protein sequence length is 2,048.