reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sample Complexity of Branch-length Estimation by Maximum Likelihood

Authors: David Clancy Jr., Hanbaek Lyu, Sebastien Roch

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this work, we provide the first theoretical guarantee as to why this might be the case. We show that deep inside the Kesten-Stigum reconstruction regime, provided with polynomially many m samples (assuming the tree is balanced), there exists a universal parameter regime (independent of the size of the tree) where the log-likelihood function is strongly concave and smooth with high probability. On this high-probability likelihood landscape event, we show that the standard coordinate maximization algorithm converges exponentially fast to the maximum likelihood estimator, which is within O(1/ m) from the true parameter, provided a sufficiently close initial point.
Researcher Affiliation	Academia	1Department of Mathematics, University of Wisconsin-Madison, WI, USA. Correspondence to: David Clancy, Jr. <EMAIL>, Hanbaek Lyu <EMAIL>, Sebastien Roch <EMAIL>.
Pseudocode	Yes	See Alg. 1 in Appendix A for a detailed implementation of the algorithm.
Open Source Code	No	No explicit statement about code availability or a link to a code repository is provided in the paper.
Open Datasets	No	The paper describes generating synthetic data for a theoretical problem: 'm samples σ(1), . . . , σ(m) from the process as described above sampled from the true model Pθ'. It does not refer to any external or publicly available datasets.
Dataset Splits	No	The paper is theoretical and focuses on mathematical proofs and algorithm convergence. It describes a simulated process ('m independent samples σ(1), . . . , σ(m) from the process as described above') rather than using pre-existing datasets that would require explicit training/test/validation splits.
Hardware Specification	No	The paper presents theoretical analysis and mathematical proofs; it does not describe any empirical experiments or the hardware used to run them.
Software Dependencies	No	The paper is theoretical and does not mention any specific software dependencies with version numbers required to replicate its findings or algorithms.
Experiment Setup	No	The paper is theoretical, providing mathematical guarantees for maximum likelihood estimation and algorithm convergence. It does not describe an experimental setup, hyperparameters, or training configurations.