Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
A Variational Approach to Bayesian Phylogenetic Inference
Authors: Cheng Zhang, Frederick A. Matsen IV
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on a benchmark of challenging real data Bayesian phylogenetic inference problems demonstrate the effectiveness and efficiency of our methods. |
| Researcher Affiliation | Academia | Cheng Zhang EMAIL School of Mathematical Sciences and Center for Statistical Science Peking University Beijing, 100871, China Frederick A. Matsen IV EMAIL Howard Hughes Medical Institute Computational Biology Program Fred Hutchinson Cancer Research Center Seattle, WA 98109, USA Department of Genome Sciences and Department of Statistics University of Washington Seattle, WA 98195, USA |
| Pseudocode | Yes | Algorithm 1 The variational Bayesian phylogenetic inference (VBPI) algorithm. 1: φ, ψ Initialize parameters 2: while not converged do 3: τ 1, . . . , τ K Random samples from the current approximating tree distribution Qφ(τ) 4: ϵ1, . . . , ϵK Random samples from the multivariate standard normal distribution N(0, I) 5: g φ,ψLK(φ, ψ; τ 1:K, ϵ1:K) (Use any gradient estimator from Section 4.4) 6: φ, ψ Update parameters using gradients g (e.g. SGA) 7: end while 8: return φ, ψ |
| Open Source Code | Yes | The code is available at https://github.com/zcrabbit/vbpi-torch. |
| Open Datasets | Yes | We test the proposed variational Bayesian phylogenetic inference (VBPI) algorithms on 8 real data sets that are commonly used to benchmark phylogenetic MCMC methods (Lakner et al., 2008; H ohna and Drummond, 2012; Larget, 2013; Whidden and Matsen IV, 2015). ... The sequences were obtained from the NIAID Influenza Research Database (IRD) (Zhang et al., 2017) through the web site at https://www.fludb.org/, downloading all complete HA sequences that passed quality control, which were then subset to H7 sequences, and further downsampled using the Average Distance to the Closest Leaf (ADCL) criterion (Matsen et al., 2013). The sequence subsets are available in the vbpi-torch Git Hub repository. |
| Dataset Splits | No | The paper does not provide explicit training/test/validation dataset splits. It discusses how datasets are used for inference and how MCMC samples are processed (burn-in), but not how data is split for evaluating model generalization in a typical machine learning sense. |
| Hardware Specification | Yes | All experiments were done on a 2019 Mac Pro (3.2 GHz 16-Core Intel Xeon W). |
| Software Dependencies | Yes | All models were implemented in Py Torch (Paszke et al., 2019) with the Adam optimizer (Kingma and Ba, 2015). ... We ran Mr Bayes with 4 chains and 10 runs for two million iterations... We ran Mr Bayes with BEAGLE backend as well. ... We used Mr Bayes with 4 chains and 5 (i.e. 2 × 10/4) times the number of iterations when compared to VBPI... The phylogenetic bootstrap performs bootstrapping on the sites of a multiple sequence alignment... All bootstrapping approaches were run with 8,000 replicates. The classical MCMC Bayesian analyses were done in Mr Bayes (Ronquist et al., 2012)... Bootstrapping with neighbor joining and maximum parsimony were performed using PAUP* (Swofford, 2001). Bootstrapping with maximum likelihood was run in UFBoot (Minh et al., 2013). ... We also compared to MCMC using BEAST 1.10.4 (Suchard et al., 2018). |
| Experiment Setup | Yes | Following Rezende and Mohamed (2015), we use a simple annealed version of the lower bound which was found to provide better results. The modified bound is: LK βt(φ, ψ) = EQφ,ψ(τ 1:K, q1:K) log [p(Y |τ i, qi)]βtp(τ i, qi) Qφ(τ i)Qψ(qi|τ i) where βt = min(1, 0.001 + t/100, 000) is an inverse temperature schedule that goes from 0.001 to 1 after 99, 900 iterations. We trained the variational approximations with VIMCO and RWS using 10 and 20 sample objectives and used Adam with the default learning rate 0.001 for both methods. We used an exponential learning rate schedule that decays the learning rate by a factor of 0.75 every 20,000 iterations. To encourage exploration in the tree topology space at the beginning, we initialize the variational parameters φ for SBNs to zero (which leads to uniformly distributed CPDs). The same settings are applied to all data sets. |