reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scalable Universal T-Cell Receptor Embeddings from Adaptive Immune Repertoires

Authors: Paidamoyo Chapfuwa, Ilker Demirel, Lorenzo Pisani, Javier Zazo, Elon Portugaly, H. Zahid, Julia Greissl

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results show that TCR embeddings targeting the same pathogen have high cosine similarity, and subject-level embeddings encode both immune genetics and pathogenic exposure history. ... 4 EXPERIMENTS We now assess the performance of TCR and repertoire embeddings on 5 disease prediction and 145 HLA inference binary classification tasks. We benchmark the proposed JL-GLOVE algorithm against competitive baselines.
Researcher Affiliation	Collaboration	1Microsoft Research, Redmond, USA; 2MIT, USA; 3Microsoft Research, Cambridge, UK
Pseudocode	Yes	Algorithm 1 Computing TCR co-occurrences C ... Algorithm 2 (Approximately) Computing JL-Norm embeddings W JL
Open Source Code	Yes	PyTorch code to train JL-GLOVE can be found at https://github.com/microsoft/jl-glove.
Open Datasets	Yes	We train JL-GLOVE embeddings using two different training cohorts: i) TDETECT cohort of N = 31, 938 repertoires (May et al., 2024) and ii) PUBLIC cohort of N = 3, 996 repertoires that are publicly available. Both training datasets are unlabeled. To demonstrate the performance on downstream tasks we use two further datasets. The MULTIID dataset is a collection of N = 10, 725 repertoires with binary disease and HLA labels. The EMERSON dataset matches that described in Emerson et al. (2017) and has both HLA and CMV labels.
Dataset Splits	Yes	We tune the penalty parameter through 5-fold crossvalidation. ... Table 1: Summary of the MULTIID and EMERSON repertoire datasets. Dataset Category Total COVID-19 HSV-1 HSV-2 Parvo CMV Typed HLA MULTIID Train [Disease/HLA] 6,136 ... MULTIID Test [Disease/HLA] 4,590 ... EMERSON Train [Disease] 666 ... EMERSON Test [Disease] 120 ... EMERSON Train [HLA] 466 ... EMERSON Test [HLA] 200
Hardware Specification	Yes	Further, we use PyTorch Lightning 3 for distributed data parallel training across one node equipped with 4 NVIDIA A100 80GB GPUs (if K 500, 000) or 2 Tesla V100 16GB GPUs (if K < 500, 000).
Software Dependencies	No	We leverage the distributed Dask framework 1 to load minibatches (partitions) of C, i.e., millions of entries in C stored as Parquet files 2. Further, we use PyTorch Lightning 3 for distributed data parallel training across one node equipped with 4 NVIDIA A100 80GB GPUs (if K 500, 000) or 2 Tesla V100 16GB GPUs (if K < 500, 000). ... Although PyTorch Lightning, Dask, and Parquet files are mentioned, no specific version numbers are provided for these software components.
Experiment Setup	Yes	We initialize the TCR embeddings W with the JL-Norm embeddings W JL and use the Adagrad optimizer (Duchi et al., 2011) with learning rate 0.05 to minimize the Glo Ve objective in Equation (1) via stochastic gradient descent on minibatches from C. ... hyperparameters are set to {s0 = 0.1, s = 0.25, α = 0.5} ... We tune the penalty parameter through 5-fold crossvalidation.