reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Gradient Flow Provably Learns Robust Classifiers for Orthonormal GMMs

Authors: Hancheng Min, Rene Vidal

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper shows that for certain data distributions one can learn a provably robust classifier using standard learning methods and without adding a defense mechanism. More specifically, this paper addresses the problem of finding a robust classifier for a binary classification problem in which the data comes from an isotropic mixture of Gaussians with orthonormal cluster centers... Our second set of results is to develop a full convergence analysis for gradient flow on a two-layer p Re LU network and show that: Theorem (Theorem 1 & Corollary 1, informal). When the intra-cluster variance α2 is sufficiently small, gradient flow on p Re LU networks (5) with p > 2 converges to a nearly optimal robust classifier. Appendix B. Additional Experiments on Learning Robust Classifiers for Data from Orthonormal GMMs
Researcher Affiliation	Academia	1Center for Innovation in Data Engineering and Science (IDEAS) 2Department of Electrical and Systems Engineering 3Department of Radiology, University of Pennsylvania, Philadelphia, U.S.A.. Correspondence to: Hancheng Min <EMAIL>.
Pseudocode	No	The paper describes mathematical derivations and theoretical proofs related to gradient flow and network architectures. It does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any concrete access to source code for the methodology described. There are no links to repositories, nor explicit statements about code release.
Open Datasets	No	The paper uses a synthetic dataset generated based on the 'Orthonormal Gaussian Mixture Model' described within the paper. It specifies 'Consider a balanced mixture of K Gaussians in RD' and later 'synthetic GMM dataset of size n = 5000'. It does not refer to a publicly available, pre-existing dataset with a link, DOI, or formal citation.
Dataset Splits	No	The paper describes generating a 'balanced dataset ˆD = {xi, yi}KN i=1' of a certain size ('n = 5000' or '20000' in experiments). While it defines the characteristics of this synthetic dataset, it does not provide explicit training/test/validation splits for experiment reproduction. It implies training on the entire generated dataset: 'trained for a sufficient amount of epochs until they achieve perfect training accuracy on a synthesis orthonormal Gaussian mixture dataset'.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments described in Appendix B.
Software Dependencies	No	The paper mentions using 'gradient descent (SGD)' but does not specify any software frameworks (e.g., PyTorch, TensorFlow) or their version numbers, nor any other ancillary software dependencies with specific versions.
Experiment Setup	Yes	Appendix B.1: 'We run GD with step size 0.2 on a synthetic GMM dataset of size n = 5000 with D = 1000, K1 = 5, K2 = 5, α = 0.1, and keep track of the following:' and 'The initialization scale is ϵ = 10 7'. Appendix B.2: 'small initialization (all weight entries are randomly initialized as N(0, 1 10 4))', 'large initialization scale, where all weight entries are randomly initialized as N(0, 0.25)'. It also states 'All networks here are trained for a sufficient amount of epochs until they achieve perfect training accuracy'.