reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bridging Protein Sequences and Microscopy Images with Unified Diffusion Models

Authors: Dihan Zheng, Bo Huang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that CELL-Diff outperforms existing methods in generating high-fidelity protein images, making it a practical tool for investigating subcellular protein localization and interactions. We train CELL-Diff on the Human Protein Atlas (HPA) dataset and fine-tune it on the Open Cell dataset. Experimental results show that our model generates more detailed and sharper protein images than previous methods.
Researcher Affiliation	Academia	1Department of Pharmaceutical Chemistry, UCSF, San Francisco, CA 94143 2Department of Biochemistry and Biophysics, UCSF, San Francisco, CA 94143 3Chan Zuckerberg Biohub San Francisco, San Francisco, CA 94158. Correspondence to: Bo Huang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Training OA-ARDM Require: Network fθ, datapoint S. Ensure: LOA-ARDM. 1: Sample t U(1, . . . , D). 2: Sample σ U(SD). 3: Compute m (σ < t). 4: Compute l (1 m) log C(S\|fθ(m S)). 5: LOA-ARDM 1 D t+1sum(l). Algorithm 2 Sampling from OA-ARDM Require: Network fθ. Ensure: Sample S. 1: Initialize S = 0, sample σ U(SD). 2: for t = 0, 1, 2, . . . , D do 3: m (σ < t) and n (σ = t). 4: S C(S\|fθ(m S)). 5: S (1 n) S + n S . 6: end for
Open Source Code	No	The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described.
Open Datasets	Yes	The Human Protein Atlas (HPA) dataset (Digre & Lindskog, 2021) includes immunofluorescence images across various human cell lines with the proteins of interest stained by antibodies. ... The Open Cell (Cho et al., 2022) dataset provides a library of 1,311 CRISPR-edited HEK293T human cell lines...
Dataset Splits	Yes	Given the size limitations of the HPA and Open Cell datasets, particularly in the diversity of protein sequences, we randomly selected 100 proteins from the shared subset between the two datasets as the test set, leaving the remainder for training. The test set for HPA and Open Cell contains 714 and 473 data points, respectively.
Hardware Specification	Yes	All models are trained using two Nvidia H200 GPUs.
Software Dependencies	No	The paper mentions software components like VAE, ESM2 model, U-Net, Stable Diffusion, and Adam optimizer but does not specify their version numbers or other crucial library versions needed for reproduction.
Experiment Setup	Yes	For the HPA dataset, images are randomly cropped to a size of 1024 and then resized to 256, while images from the Open Cell dataset are directly cropped to 256 pixels. Data augmentation is applied using random flips and rotations. The latent representation has dimensions of 4 64 64. The KL loss coefficient is set to 1 10 5. The learning rate is initialized using a linear warm-up strategy, increasing from 0 to 3 10 4 over the first 1,000 iterations, followed by a linear decay to zero. The batch size is set to 192. The VAE is trained for a total of 50,000 steps on the HPA dataset and fine-tuned for 20,000 steps on the Open Cell dataset. Next, we fix the VAE model and train the latent diffusion model. CELL-Diff is pre-trained on the HPA dataset and fine-tune on the Open Cell dataset. Both pre-training and fine-tuning are conducted for 50,000 iterations using the Adam optimizer (Kingma & Ba, 2014). The learning rate is initialized using a linear warm-up strategy, increasing from 0 to 1 10 4 over the first 1,000 iterations, followed by a linear decay to zero. The batch size is set to 64. The sequence embedding dimension is 1280, and the bidirectional transformer module consists of 8 layers with 8-head attention. CELL-Diff is trained with 200 diffusion steps using the cosine noise schedules (Peebles & Xie, 2023), and use DDIM (Song et al., 2020) with 100 steps to accelerate the sampling speed. The weighting coefficient λ in (12) is set to 1, and the maximum protein sequence length is 2,048.