Scalable Universal T-Cell Receptor Embeddings from Adaptive Immune Repertoires

Authors: Paidamoyo Chapfuwa, Ilker Demirel, Lorenzo Pisani, Javier Zazo, Elon Portugaly, H. Zahid, Julia Greissl

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results show that TCR embeddings targeting the same pathogen have high cosine similarity, and subject-level embeddings encode both immune genetics and pathogenic exposure history. ... 4 EXPERIMENTS We now assess the performance of TCR and repertoire embeddings on 5 disease prediction and 145 HLA inference binary classification tasks. We benchmark the proposed JL-GLOVE algorithm against competitive baselines.
Researcher Affiliation Collaboration 1Microsoft Research, Redmond, USA; 2MIT, USA; 3Microsoft Research, Cambridge, UK
Pseudocode Yes Algorithm 1 Computing TCR co-occurrences C ... Algorithm 2 (Approximately) Computing JL-Norm embeddings W JL
Open Source Code Yes PyTorch code to train JL-GLOVE can be found at https://github.com/microsoft/jl-glove.
Open Datasets Yes We train JL-GLOVE embeddings using two different training cohorts: i) TDETECT cohort of N = 31, 938 repertoires (May et al., 2024) and ii) PUBLIC cohort of N = 3, 996 repertoires that are publicly available. Both training datasets are unlabeled. To demonstrate the performance on downstream tasks we use two further datasets. The MULTIID dataset is a collection of N = 10, 725 repertoires with binary disease and HLA labels. The EMERSON dataset matches that described in Emerson et al. (2017) and has both HLA and CMV labels.
Dataset Splits Yes We tune the penalty parameter through 5-fold crossvalidation. ... Table 1: Summary of the MULTIID and EMERSON repertoire datasets. Dataset Category Total COVID-19 HSV-1 HSV-2 Parvo CMV Typed HLA MULTIID Train [Disease/HLA] 6,136 ... MULTIID Test [Disease/HLA] 4,590 ... EMERSON Train [Disease] 666 ... EMERSON Test [Disease] 120 ... EMERSON Train [HLA] 466 ... EMERSON Test [HLA] 200
Hardware Specification Yes Further, we use PyTorch Lightning 3 for distributed data parallel training across one node equipped with 4 NVIDIA A100 80GB GPUs (if K 500, 000) or 2 Tesla V100 16GB GPUs (if K < 500, 000).
Software Dependencies No We leverage the distributed Dask framework 1 to load minibatches (partitions) of C, i.e., millions of entries in C stored as Parquet files 2. Further, we use PyTorch Lightning 3 for distributed data parallel training across one node equipped with 4 NVIDIA A100 80GB GPUs (if K 500, 000) or 2 Tesla V100 16GB GPUs (if K < 500, 000). ... Although PyTorch Lightning, Dask, and Parquet files are mentioned, no specific version numbers are provided for these software components.
Experiment Setup Yes We initialize the TCR embeddings W with the JL-Norm embeddings W JL and use the Adagrad optimizer (Duchi et al., 2011) with learning rate 0.05 to minimize the Glo Ve objective in Equation (1) via stochastic gradient descent on minibatches from C. ... hyperparameters are set to {s0 = 0.1, s = 0.25, α = 0.5} ... We tune the penalty parameter through 5-fold crossvalidation.