reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Megaman: Scalable Manifold Learning in Python

Authors: James McQueen, Marina Meilă, Jacob VanderPlas, Zhongyue Zhang

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In benchmarks, even on a single-core desktop computer, our code embeds millions of data points in minutes, and takes just 200 minutes to embed the main sample of galaxy spectra from the Sloan Digital Sky Survey consisting of 0.6 million samples in 3750-dimensions a task which has not previously been possible. We display total embedding time (including time to compute the graph G, the Laplacian matrix and the embedding 2) for megaman versus scikit-learn, as the number of samples N varies or the data dimension D varies (Figure 1). All benchmark computations were performed on a single desktop computer running Linux with 24.68GB RAM and a Quad-Core 3.07GHz Intel Xeon CPU. We use a relatively weak machine to demonstrate that our package can be reasonably used without high performance hardware. The experiments show that megaman scales considerably better than scikit-learn
Researcher Affiliation	Academia	James Mc Queen EMAIL Department of Statistics University of Washington Seattle, WA 98195-4322, USA Marina Meil a EMAIL Department of Statistics University of Washington Seattle, WA 98195-4322, USA Jacob Vander Plas EMAIL e-Science Institute University of Washington Seattle, WA 98195-4322, USA Zhongyue Zhang EMAIL Department of Computer Science and Engineering University of Washington Seattle, WA 98195-4322, USA
Pseudocode	No	The paper provides actual Python code snippets in Section 4 'Quick start' to demonstrate library usage, rather than pseudocode or algorithm blocks outlining the methodology.
Open Source Code	Yes	megaman is publicly available at: https://github.com/mmp2/megaman.
Open Datasets	Yes	The word2vec data used were from Google News-vectors-negative300.bin.gz which can be downloaded from https://code.google.com/archive/p/word2vec/. The Sloan Digital Sky Survey data can be downloaded from www.sdss.org.
Dataset Splits	No	The paper mentions using datasets like 'Swiss Roll', 'word2vec', and 'Sloan Digital Sky Survey' for benchmarking, but it does not specify any training, validation, or test splits. It only refers to varying the number of samples (N) or dimensions (D).
Hardware Specification	Yes	All benchmark computations were performed on a single desktop computer running Linux with 24.68GB RAM and a Quad-Core 3.07GHz Intel Xeon CPU.
Software Dependencies	No	megaman s required dependencies are numpy, scipy, and scikit-learn, but for optimal performance FLANN, cython, pyamg and the C compiler gcc are also required. The paper lists software dependencies but does not provide specific version numbers for them.
Experiment Setup	Yes	from megaman.geometry import Geometry from megaman.embedding import Spectral Embedding ... radius = 1.1 ... geom = Geometry( adjacency_kwds = { radius :3* radius}, ... affinity_kwds = { radius : radius}, ... SE = Spectral Embedding ( n_components = 2, eigen_solver = amg , geom=geom)