Predicting sub-population specific viral evolution

Authors: Wenxian Shi, Menghua Wu, Regina Barzilay

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Multi-year evaluation on both SARS-Co V-2 and influenza A/H3N2 demonstrates that our model outperforms baselines in accurately predicting distributions of viral proteins across continents and countries. The evaluation shows that our model outperforms state-of-the-art baselines in predicting future distributions of viral proteins. As shown in Fig. 4, our model achieves the best frontier in the average NLL and reverse NLL space for both Flu and Cov, when predicting protein distributions on both continent and country levels. Table 1: The coverage (total frequencies of occurrence) of top-100, top-300 and top-500 sequences generated by models for Cov. Our ablation studies dissect the value of incorporating sub-populations as additional signals or through architectural changes (factorizing global distributions into mixtures); the runtime and performance trade-off of hierarchical modeling; and other design choices.
Researcher Affiliation Academia Wenxian Shi EMAIL Department of Computer Science Massachusetts Institute of Technology; Menghua Wu EMAIL Department of Computer Science Massachusetts Institute of Technology; Regina Barzilay EMAIL Department of Computer Science Massachusetts Institute of Technology
Pseudocode No The paper describes its methodology using mathematical equations and textual explanations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes The code is available at https://github.com/wxsh1213/vaxseer/tree/main/transmission.
Open Datasets Yes We obtain the amino acid sequences of these proteins from GISAID (Shu & Mc Cauley, 2017).
Dataset Splits Yes For influenza, we emulate the annual recommendation schedule for the northern hemisphere. Since egg-based vaccines require lead times of up to 6 months, we train our models on data collected before February of each year, and evaluate models on the sequences collected from October to March of the next year (winter season), following Shi et al. (2023). ... Specifically, we trained four models using data collected before four end-points: 2021-07, 2021-10, 2022-01, and 2022-04. For instance, a model trained on sequences collected before 2021-07-01 will be evaluated on sequences collected between 2021-10-01 and 2022-01-01.
Hardware Specification Yes We trained our model on 48G NVIDIA RTX A6000.
Software Dependencies No The paper mentions 'GPT-2 (Radford et al., 2019)' and 'Adam optimizer' but does not provide specific version numbers for these or other software libraries like PyTorch, TensorFlow, or Python itself. While cuSOLVER (NVIDIA Corporation, 2023) is cited with a version, it is not explicitly stated as a direct software dependency that users would install to replicate their codebase.
Experiment Setup Yes For continent-level transmission models, we use a 6-layer GPT-2 (Radford et al., 2019) to parameterize the transmission rate matrix Aθ and another 6-layer GPT-2 to model the initial occurrence N0(x; θ). ... Adam optimizer with learning rates 1e-5 and 5e-5 are used for Flu and Cov, and the models are trained for 80,000 steps for Flu and 30,000 for Cov with batch sizes 32 and 256 respectively. The learning rate is linearly warmed up from 0 to the specified value in the first 10% epochs and then decays linearly to zero. ... We set the λ for group regression loss Lgroup to 0.1. ... While the transmission rate matrix is not necessarily symmetric, assuming it is a real symmetric matrix is beneficial for training stability and acceleration. Thus, in practice, we parameterize the transmission rate matrix Aθ(x) as a positive and real symmetric matrix.