reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data

Authors: T. Tony Cai, Rong Ma

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	A novel theoretical framework for the analysis of t-SNE based on the gradient descent approach is presented. For the early exaggeration stage of t-SNE, we show its asymptotic equivalence to power iterations based on the underlying graph Laplacian, characterize its limiting behavior, and uncover its deep connection to Laplacian spectral clustering, and fundamental principles including early stopping as implicit regularization. The results explain the intrinsic mechanism and the empirical beneﬁts of such a computational strategy. For the embedding stage of t-SNE, we characterize the kinematics of the low-dimensional map throughout the iterations, and identify an ampliﬁcation phase, featuring the intercluster repulsion and the expansive behavior of the low-dimensional map, and a stabilization phase. The general theory explains the fast convergence rate and the exceptional empirical performance of t-SNE for visualizing clustered data, brings forth the interpretations of the t-SNE visualizations, and provides theoretical guidance for applying t-SNE and selecting its tuning parameters in various applications. Sections 4 and 5: Application I: Visualizing Model-Based Clustered Data, Application II: Visualizing Real-World Clustered Data.
Researcher Affiliation	Academia	T. Tony Cai EMAIL Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA 19104, USA. Rong Ma EMAIL Department of Statistics Stanford University Stanford, CA 94305, USA.
Pseudocode	No	The paper describes the iterative algorithms using mathematical equations, such as equation (4) for the updating equation and equation (6) for the early exaggeration stage, but does not present them in a structured pseudocode or algorithm block.
Open Source Code	No	The text mentions using an existing R package: "The visulaizations are obtained using the Rtsne function in the R package Rtsne". However, there is no explicit statement or link indicating that the authors have released their own implementation code for the methodology described in this paper.
Open Datasets	Yes	Finally, we demonstrate our theory by applying t-SNE to the MNIST3 dataset, which contains images of hand-written digits. Speciﬁcally, we focus on n = 4N = 1600 images of hand-written digits 2, 4, 6 and 8, with each digit having N = 400 images. Each image contains 28 x 28 pixels and was treated as a 784-dimensional vector. Based on our theoretical analysis, we set the tuning parameters... 3. http://yann.lecun.com/exdb/mnist/
Dataset Splits	No	Speciﬁcally, we focus on n = 4N = 1600 images of hand-written digits 2, 4, 6 and 8, with each digit having N = 400 images. This describes the composition of the dataset used, but does not provide specific training/test/validation splits or percentages required for reproducibility.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud resources) used for running the experiments or generating the visualizations.
Software Dependencies	No	The visualizations are obtained using the Rtsne function in the R package Rtsne... (Figure 1 caption). The paper mentions the 'R package Rtsne' but does not specify a version number for this or any other software dependency.
Experiment Setup	Yes	The visulaizations are obtained using the Rtsne function in the R package Rtsne, by selecting the exact t-SNE mode (theta=0, pca=F), dropping the momentum terms (momentum=0, final momentum = 0), and setting perplexity=30 (default), α = 12 (default), h = 200 (default) in (6), and K0 = 40. (Figure 1 caption) ...we set the tuning parameters α = n^(1-δ), h = h' = n^δ, K0 = (log n)^2 (39) with δ = 2/3. (Section 5) Diﬀerent columns have identical initialization and tuning parameters, but distinct numbers of iterations for the early exaggeration stage, namely K0 = (log n)^2 = 54 (left), K0 = n^(2/3) = 137 (middle), and K0 = n^(3/4) = 253 (right). (Figure 5 caption)