MAESTRO: Masked Encoding Set Transformer with Self-Distillation
Authors: Matthew Lee, Jaesik Kim, Matei Ionita, Jonghyun Lee, Michelle McKeague, YONGHYUN NAM, Irene Khavin, Yidi Huang, Victoria Fang, Sokratis Apostolidis, Divij Mathew, Shwetank, Ajinkya Pattekar, Zahabia Rangwala, Amit Bar-Or, Benjamin Fensterheim, Benjamin Abramoff, Rennie Rhee, Damian Maseda, Allison Greenplate, John Wherry, Dokyoon Kim
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the effectiveness of MAESTRO in representing and analyzing cytometry set data, we conducted experiments using a large cohort of cytometry samples. We benchmark our model against existing cytometry approaches and other existing machine learning methods that have never been applied in cytometry. Our model outperforms existing approaches in retrieving celltype distributions and capturing clinically relevant features for downstream tasks such as disease diagnosis, age, sex. To demonstrate MAESTRO s capability in accurately reconstructing masked cells, one of the objectives during training, Figure 2 showcases the reconstruction results for eight randomly selected samples from a held-out test set during pre-training. We evaluated the quality of MAESTRO s latent representations by projecting the 1,024-dimensional embeddings into two dimensions using UMAP for visualization, colored by each sample s diagnosis (Figure 3). To evaluate the discriminative power of the learned representations, we performed a linear probing task where we used the latent representations as inputs to a basic regression model for predicting diagnosis, age, and sex. Lastly, we perform an ablation study on the same task using our MAESTRO architecture, indicating the importance of various design choices (Table 1). |
| Researcher Affiliation | Academia | 1Institute for Immune Health and Immunology 2Department of Biostatistics Epidemiology and Informatics 3Department of Systems Pharmacology & Translational Therapeutics University of Pennsylvania Philadelphia, PA 19104, USA {matthew.lee1}@pennmedicine.upenn.edu |
| Pseudocode | Yes | Algorithm 1 Non-Random Block Masking (NRBM) Input: Set S = {x1, . . . , xn} Rd, mask ratio ρ [0, 1] Output: Masked set S , mask vector M; Algorithm 2 Sinkhorn Optimal Transport Distance Input: Sets S, ˆS, Iterations T Output: Sinkhorn Distance d; Algorithm 3 MAESTRO Model Overview Input: Input set S, mask ratio ρ, EMA decay rate α, temperature τ Output: Reconstructed set ˆS |
| Open Source Code | Yes | Code available: https://github.com/matthewlee1/MAESTRO |
| Open Datasets | No | We utilized a dataset of 1,514 whole blood cytometry samples spanning 14 cohorts and 11 phenotypes (Appendix E.1). Data were generated at three locations over various time points (Appendix E.2), with raw data showing batch effects (Appendix E.3.1). We employed the technical control sample Batch Control HD2, which exhibits minimal batch effects in learned representations (Figure 3) compared to raw data (Appendix E.3.2). Disease diagnostic and meta data were provided by the primary clinician teams for each study. Appendix E.1 displays the distribution of cell counts per sample (minimum=11,829; maximum=1,386,520), highlighting dataset variability. Each sample is represented as a matrix with cells as rows and proteins as columns. Notably, cell types obtained through manual gating are used only to evaluate the representations learned by MAESTRO. The paper describes the dataset and its characteristics in detail, but it does not provide any concrete access information (link, DOI, repository, or citation to a public dataset) for this specific dataset. It appears to be a newly generated dataset by the authors. |
| Dataset Splits | Yes | Using 1,514 whole blood cytometry samples, we create a train and test set using an 80/20 random split within each diagnosis to ensure that all diagnoses are represented in the test set. This test dataset is never used for training during pre-training, linear probing, ablation, or cell-type proportion retrieval. |
| Hardware Specification | Yes | We pre-train MAESTRO on four NVIDIA A100 GPU s for 156 hours at 29 minutes per epoch (324 epochs). Memory usage varies as each sample is of variable size, but on average the model consumes 58 GB per GPU. |
| Software Dependencies | No | The logistic regression model is implemented using scikit-learn Logistic Regression(). We train on our dataset with Adam W optimizer and a batch size of 1. The paper mentions software like "scikit-learn" and "Adam W optimizer", but it does not provide specific version numbers for these or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | The learning rate starts at 1e-4 and using a cosine annealing scheduler to a minimum of 1e-8. We use NRBM to mask four different copies of our samples at masking rates of 20%, 40%, 60%, and 80%. Masking is done on the fly during pre-training and not before-hand. We mask by replacing selected indices with a learnable mask token. MAESTRO is configured with four attention heads, a hidden size of 2,048, and a latent dimension of 1,024. Our input cells start at 30 dimensions (representing 30 different protein markers), which we use a linear layer to transform to 1,024 dimensions before using three ISAB (with 2,500 learned points per ISAB) blocks, followed by PMA. After pooling to a single learnable seed (vector), we decode the vector by copying the pooled input to each index of the unmasked cells and use the masking token to denote indices where the original cell was masked. We follow this with a PMA block with number of seeds equal to the size of the input matrix, and follow this with three SAB blocks. Finally we use a linear layer to transform our output to its original 30 dimensions. All attention blocks use a Swi GLU activation function to learn both linear and non-linear relationships (Shazeer, 2020). Reconstruction is calculated using Sinkhorn Optimal Transport with Euclidean distance for the cost matrix calculation. We align latent representations from the student and teacher model by using a non-linear projection head on the latent representation. Following the projection head we use softmax activation to convert our representations to probability distributions. We use Kullback-Leibler (KL) Divergence to minimize the difference in distributions. The teacher model is updated with the exponential moving average of the student and a momentum value of 0.999. Student and teacher model temperatures are set to 0.1 and 0.07, respectively. |