Scalable Universal T-Cell Receptor Embeddings from Adaptive Immune Repertoires
Authors: Paidamoyo Chapfuwa, Ilker Demirel, Lorenzo Pisani, Javier Zazo, Elon Portugaly, H. Zahid, Julia Greissl
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results show that TCR embeddings targeting the same pathogen have high cosine similarity, and subject-level embeddings encode both immune genetics and pathogenic exposure history. ... 4 EXPERIMENTS We now assess the performance of TCR and repertoire embeddings on 5 disease prediction and 145 HLA inference binary classification tasks. We benchmark the proposed JL-GLOVE algorithm against competitive baselines. |
| Researcher Affiliation | Collaboration | 1Microsoft Research, Redmond, USA; 2MIT, USA; 3Microsoft Research, Cambridge, UK |
| Pseudocode | Yes | Algorithm 1 Computing TCR co-occurrences C ... Algorithm 2 (Approximately) Computing JL-Norm embeddings W JL |
| Open Source Code | Yes | PyTorch code to train JL-GLOVE can be found at https://github.com/microsoft/jl-glove. |
| Open Datasets | Yes | We train JL-GLOVE embeddings using two different training cohorts: i) TDETECT cohort of N = 31, 938 repertoires (May et al., 2024) and ii) PUBLIC cohort of N = 3, 996 repertoires that are publicly available. Both training datasets are unlabeled. To demonstrate the performance on downstream tasks we use two further datasets. The MULTIID dataset is a collection of N = 10, 725 repertoires with binary disease and HLA labels. The EMERSON dataset matches that described in Emerson et al. (2017) and has both HLA and CMV labels. |
| Dataset Splits | Yes | We tune the penalty parameter through 5-fold crossvalidation. ... Table 1: Summary of the MULTIID and EMERSON repertoire datasets. Dataset Category Total COVID-19 HSV-1 HSV-2 Parvo CMV Typed HLA MULTIID Train [Disease/HLA] 6,136 ... MULTIID Test [Disease/HLA] 4,590 ... EMERSON Train [Disease] 666 ... EMERSON Test [Disease] 120 ... EMERSON Train [HLA] 466 ... EMERSON Test [HLA] 200 |
| Hardware Specification | Yes | Further, we use PyTorch Lightning 3 for distributed data parallel training across one node equipped with 4 NVIDIA A100 80GB GPUs (if K 500, 000) or 2 Tesla V100 16GB GPUs (if K < 500, 000). |
| Software Dependencies | No | We leverage the distributed Dask framework 1 to load minibatches (partitions) of C, i.e., millions of entries in C stored as Parquet files 2. Further, we use PyTorch Lightning 3 for distributed data parallel training across one node equipped with 4 NVIDIA A100 80GB GPUs (if K 500, 000) or 2 Tesla V100 16GB GPUs (if K < 500, 000). ... Although PyTorch Lightning, Dask, and Parquet files are mentioned, no specific version numbers are provided for these software components. |
| Experiment Setup | Yes | We initialize the TCR embeddings W with the JL-Norm embeddings W JL and use the Adagrad optimizer (Duchi et al., 2011) with learning rate 0.05 to minimize the Glo Ve objective in Equation (1) via stochastic gradient descent on minibatches from C. ... hyperparameters are set to {s0 = 0.1, s = 0.25, α = 0.5} ... We tune the penalty parameter through 5-fold crossvalidation. |