Supervised Dimensionality Reduction and Visualization using Centroid-Encoder

Authors: Tomojit Ghosh, Michael Kirby

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a detailed comparative analysis of the method using a wide variety of data sets and techniques, both supervised and unsupervised, including NCA, non-linear NCA, t-distributed NCA, t-distributed MCML, supervised UMAP, supervised PCA, Colored Maximum Variance Unfolding, supervised Isomap, Parametric Embedding, supervised Neighbor Retrieval Visualizer, and Multiple Relational Embedding. An analysis of variance using PCA demonstrates that a non-linear preprocessing by the CE transformation of the data captures more variance than PCA by dimension.
Researcher Affiliation Academia Tomojit Ghosh EMAIL Department of Computer Science Colorado State University Fort Collins, CO 80523 ,USA Michael Kirby EMAIL Department of Mathematics Colorado State University Fort Collins, CO 80523, USA
Pseudocode Yes Algorithm 1: Supervised Non-linear Centroid-Encoder (without pre-training). Input: Labeled data (D) = {xi}N i=1 with M classes, Ij index set of class Cj. User defined parameters: error tolerance τ, learning rate µ, bottleneck dimension m. Output: Bottleneck output yi = g(xi); network parameters θ. Result: Non-linear embedding of data in m dimensions. Initialization: Class centroids cj = 1 |Cj| P i Ij xi, j = 1, . . . , M. Iteration t 0. Partition D into training (Tr) and validation set (V). 1 while |Lt+1 C (V ) Lt C(V )| > τ do 2 Compute the loss/error LC(V ) and LC(Tr) using Equation 1 3 Compute backpropogation error C using Equation 3 on training set Tr. 4 Update model parameters θ by θt+1 = θt µ C.gt
Open Source Code Yes The implementation is available at: https://github.com/Tomojit1/Centroid-encoder/ tree/master/GPU.
Open Datasets Yes The data sets we employed in our experiments are: (1) MNIST digits, (2) USPS data, (3) Letter Recognition data, (4) Landsat Satellite data, (5) Phoneme data, (6) Iris data, (7) Sonar data, and (8) Supersymmetry (SUSY) particle data. The details of the data sets may be found in the supplementary material.
Dataset Splits Yes For USPS, we followed the strategy in (Min et al., 2010), where we randomly split the entire data set into a training set of 8000 samples and a test set consisting of 3000 samples. We repeat the experiments K = 10 times and report the average error rate with standard deviation. To train CE with an optimal number of epochs, we used 10% of training samples as a validation set in all of our visualization experiments.
Hardware Specification No We have implemented centroid-encoder in Py Torch to run on GPUs. To train CE with an optimal number of epochs, we used 10% of training samples as a validation set in all of our visualization experiments.
Software Dependencies No We have implemented centroid-encoder in Py Torch to run on GPUs... and we used the scikit-learn (Pedregosa et al., 2011) package to run supervised UMAP.
Experiment Setup Yes Select a small data set from the training set and run 10-fold cross validation to determine the network architecture and hyper-parameters Using this architecture train K models on different data partitions as described below Using the sequestered test set compute average k-NN (k = 5) classification errors on the 2D representation with standard deviations Table 1: Network topology used for CE on various data sets. The number d is the input dimension of the network and is data set dependent. Table 2: Hyper-parameters for different models. The model parameters are updated using Adam optimizer (Kingma and Ba, 2014).