reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Supervised Dimensionality Reduction and Visualization using Centroid-Encoder

Authors: Tomojit Ghosh, Michael Kirby

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a detailed comparative analysis of the method using a wide variety of data sets and techniques, both supervised and unsupervised, including NCA, non-linear NCA, t-distributed NCA, t-distributed MCML, supervised UMAP, supervised PCA, Colored Maximum Variance Unfolding, supervised Isomap, Parametric Embedding, supervised Neighbor Retrieval Visualizer, and Multiple Relational Embedding. An analysis of variance using PCA demonstrates that a non-linear preprocessing by the CE transformation of the data captures more variance than PCA by dimension.
Researcher Affiliation	Academia	Tomojit Ghosh EMAIL Department of Computer Science Colorado State University Fort Collins, CO 80523 ,USA Michael Kirby EMAIL Department of Mathematics Colorado State University Fort Collins, CO 80523, USA
Pseudocode	Yes	Algorithm 1: Supervised Non-linear Centroid-Encoder (without pre-training). Input: Labeled data (D) = {xi}N i=1 with M classes, Ij index set of class Cj. User deﬁned parameters: error tolerance τ, learning rate µ, bottleneck dimension m. Output: Bottleneck output yi = g(xi); network parameters θ. Result: Non-linear embedding of data in m dimensions. Initialization: Class centroids cj = 1 \|Cj\| P i Ij xi, j = 1, . . . , M. Iteration t 0. Partition D into training (Tr) and validation set (V). 1 while \|Lt+1 C (V ) Lt C(V )\| > τ do 2 Compute the loss/error LC(V ) and LC(Tr) using Equation 1 3 Compute backpropogation error C using Equation 3 on training set Tr. 4 Update model parameters θ by θt+1 = θt µ C.gt
Open Source Code	Yes	The implementation is available at: https://github.com/Tomojit1/Centroid-encoder/ tree/master/GPU.
Open Datasets	Yes	The data sets we employed in our experiments are: (1) MNIST digits, (2) USPS data, (3) Letter Recognition data, (4) Landsat Satellite data, (5) Phoneme data, (6) Iris data, (7) Sonar data, and (8) Supersymmetry (SUSY) particle data. The details of the data sets may be found in the supplementary material.
Dataset Splits	Yes	For USPS, we followed the strategy in (Min et al., 2010), where we randomly split the entire data set into a training set of 8000 samples and a test set consisting of 3000 samples. We repeat the experiments K = 10 times and report the average error rate with standard deviation. To train CE with an optimal number of epochs, we used 10% of training samples as a validation set in all of our visualization experiments.
Hardware Specification	No	We have implemented centroid-encoder in Py Torch to run on GPUs. To train CE with an optimal number of epochs, we used 10% of training samples as a validation set in all of our visualization experiments.
Software Dependencies	No	We have implemented centroid-encoder in Py Torch to run on GPUs... and we used the scikit-learn (Pedregosa et al., 2011) package to run supervised UMAP.
Experiment Setup	Yes	Select a small data set from the training set and run 10-fold cross validation to determine the network architecture and hyper-parameters Using this architecture train K models on diﬀerent data partitions as described below Using the sequestered test set compute average k-NN (k = 5) classiﬁcation errors on the 2D representation with standard deviations Table 1: Network topology used for CE on various data sets. The number d is the input dimension of the network and is data set dependent. Table 2: Hyper-parameters for diﬀerent models. The model parameters are updated using Adam optimizer (Kingma and Ba, 2014).