reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Predicting sub-population specific viral evolution

Authors: Wenxian Shi, Menghua Wu, Regina Barzilay

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Multi-year evaluation on both SARS-Co V-2 and influenza A/H3N2 demonstrates that our model outperforms baselines in accurately predicting distributions of viral proteins across continents and countries. The evaluation shows that our model outperforms state-of-the-art baselines in predicting future distributions of viral proteins. As shown in Fig. 4, our model achieves the best frontier in the average NLL and reverse NLL space for both Flu and Cov, when predicting protein distributions on both continent and country levels. Table 1: The coverage (total frequencies of occurrence) of top-100, top-300 and top-500 sequences generated by models for Cov. Our ablation studies dissect the value of incorporating sub-populations as additional signals or through architectural changes (factorizing global distributions into mixtures); the runtime and performance trade-off of hierarchical modeling; and other design choices.
Researcher Affiliation	Academia	Wenxian Shi EMAIL Department of Computer Science Massachusetts Institute of Technology; Menghua Wu EMAIL Department of Computer Science Massachusetts Institute of Technology; Regina Barzilay EMAIL Department of Computer Science Massachusetts Institute of Technology
Pseudocode	No	The paper describes its methodology using mathematical equations and textual explanations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	The code is available at https://github.com/wxsh1213/vaxseer/tree/main/transmission.
Open Datasets	Yes	We obtain the amino acid sequences of these proteins from GISAID (Shu & Mc Cauley, 2017).
Dataset Splits	Yes	For influenza, we emulate the annual recommendation schedule for the northern hemisphere. Since egg-based vaccines require lead times of up to 6 months, we train our models on data collected before February of each year, and evaluate models on the sequences collected from October to March of the next year (winter season), following Shi et al. (2023). ... Specifically, we trained four models using data collected before four end-points: 2021-07, 2021-10, 2022-01, and 2022-04. For instance, a model trained on sequences collected before 2021-07-01 will be evaluated on sequences collected between 2021-10-01 and 2022-01-01.
Hardware Specification	Yes	We trained our model on 48G NVIDIA RTX A6000.
Software Dependencies	No	The paper mentions 'GPT-2 (Radford et al., 2019)' and 'Adam optimizer' but does not provide specific version numbers for these or other software libraries like PyTorch, TensorFlow, or Python itself. While cuSOLVER (NVIDIA Corporation, 2023) is cited with a version, it is not explicitly stated as a direct software dependency that users would install to replicate their codebase.
Experiment Setup	Yes	For continent-level transmission models, we use a 6-layer GPT-2 (Radford et al., 2019) to parameterize the transmission rate matrix Aθ and another 6-layer GPT-2 to model the initial occurrence N0(x; θ). ... Adam optimizer with learning rates 1e-5 and 5e-5 are used for Flu and Cov, and the models are trained for 80,000 steps for Flu and 30,000 for Cov with batch sizes 32 and 256 respectively. The learning rate is linearly warmed up from 0 to the specified value in the first 10% epochs and then decays linearly to zero. ... We set the λ for group regression loss Lgroup to 0.1. ... While the transmission rate matrix is not necessarily symmetric, assuming it is a real symmetric matrix is beneficial for training stability and acceleration. Thus, in practice, we parameterize the transmission rate matrix Aθ(x) as a positive and real symmetric matrix.