Megaman: Scalable Manifold Learning in Python

Authors: James McQueen, Marina Meilă, Jacob VanderPlas, Zhongyue Zhang

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In benchmarks, even on a single-core desktop computer, our code embeds millions of data points in minutes, and takes just 200 minutes to embed the main sample of galaxy spectra from the Sloan Digital Sky Survey consisting of 0.6 million samples in 3750-dimensions a task which has not previously been possible. We display total embedding time (including time to compute the graph G, the Laplacian matrix and the embedding 2) for megaman versus scikit-learn, as the number of samples N varies or the data dimension D varies (Figure 1). All benchmark computations were performed on a single desktop computer running Linux with 24.68GB RAM and a Quad-Core 3.07GHz Intel Xeon CPU. We use a relatively weak machine to demonstrate that our package can be reasonably used without high performance hardware. The experiments show that megaman scales considerably better than scikit-learn
Researcher Affiliation Academia James Mc Queen EMAIL Department of Statistics University of Washington Seattle, WA 98195-4322, USA Marina Meil a EMAIL Department of Statistics University of Washington Seattle, WA 98195-4322, USA Jacob Vander Plas EMAIL e-Science Institute University of Washington Seattle, WA 98195-4322, USA Zhongyue Zhang EMAIL Department of Computer Science and Engineering University of Washington Seattle, WA 98195-4322, USA
Pseudocode No The paper provides actual Python code snippets in Section 4 'Quick start' to demonstrate library usage, rather than pseudocode or algorithm blocks outlining the methodology.
Open Source Code Yes megaman is publicly available at: https://github.com/mmp2/megaman.
Open Datasets Yes The word2vec data used were from Google News-vectors-negative300.bin.gz which can be downloaded from https://code.google.com/archive/p/word2vec/. The Sloan Digital Sky Survey data can be downloaded from www.sdss.org.
Dataset Splits No The paper mentions using datasets like 'Swiss Roll', 'word2vec', and 'Sloan Digital Sky Survey' for benchmarking, but it does not specify any training, validation, or test splits. It only refers to varying the number of samples (N) or dimensions (D).
Hardware Specification Yes All benchmark computations were performed on a single desktop computer running Linux with 24.68GB RAM and a Quad-Core 3.07GHz Intel Xeon CPU.
Software Dependencies No megaman s required dependencies are numpy, scipy, and scikit-learn, but for optimal performance FLANN, cython, pyamg and the C compiler gcc are also required. The paper lists software dependencies but does not provide specific version numbers for them.
Experiment Setup Yes from megaman.geometry import Geometry from megaman.embedding import Spectral Embedding ... radius = 1.1 ... geom = Geometry( adjacency_kwds = { radius :3* radius}, ... affinity_kwds = { radius : radius}, ... SE = Spectral Embedding ( n_components = 2, eigen_solver = amg , geom=geom)