reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Supervised and Semi-Supervised Diffusion Maps with Label-Driven Diffusion

Authors: Harel Mendelman, Ronen Talmon

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present experimental results that showcase the performance of SDM and SSDM on a synthetic dataset and 12 real datasets. These experiments demonstrate the effectiveness of SDM and SSDM in learning low-dimensional embeddings and their impact on downstream tasks, such as regression and classification, compared to several classical and recent baselines.
Researcher Affiliation	Academia	Harel Mendelman, Ronen Talmon Viterbi Faculty of Electrical and Computer Engineering, Technion EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Generate the embedding of {xi}n i=1 x Algorithm 2 Supervised Diffusion Maps (SDM) Algorithm 3 Semi-Supervised Diffusion Maps (SSDM) Algorithm 4 Pseudocode for SVD Shrinkage-Based Denoising Tactic
Open Source Code	Yes	We provide a Python implementation of SDM and SSDM.1 The code is available at https://github.com/harel147/sdm. Our source code is available at https://github.com/harel147/sdm.
Open Datasets	Yes	We evaluated 7 datasets from the UCI Machine Learning Repository and Scikit-learn library (see Appendix E.1 for details). Table 3: Datasets Dataset Type n d Iris (Fisher, 1988) C 150 4 Ionosphere (Sigillito & Baker, 1989) C 351 34 Yacht (Gerritsma & Versluis, 2013) R 308 6 Boston (Harrison Jr & Rubinfeld, 1978) R 506 13 Liver (mis, 1990) R 345 5 Arrhythmia (Guvenir & Quinlan, 1998) C 452 279 Musk (Chapman & Jain, 1994) C 476 166 Mice (Higuera & Cios, 2015) C 1080 77 Rice (mis, 2019) C 3810 7 Concrete (Yeh, 2007) R 1030 8 Silhouettes (Mowforth & Shepherd) C 845 18 Raisin (C inar & Tasdemir, 2023) C 900 7
Dataset Splits	Yes	Each dataset underwent 50 data splits (70% training 30% testing) for learning dimension reduction embedding and training KNN models 5 neighbors for regression and 1 neighbor for classification. Consistent training (100 samples, 20% of the dataset) and testing sets (100 samples) were used across experiments to ensure that the results reflect only embedding quality.
Hardware Specification	Yes	The hardware utilized for testing the runtimes of SDM and SSDM, as depicted in Table 6 in Appendix E.5, is the ROG Strix G16 Asus laptop equipped with an Intel i9-14900HX processor.
Software Dependencies	No	In our study, we utilized the official Sklearn implementations for t-SNE, Isomap, LE and LLE. For UMAP, supervised UMAP, and semi-supervised UMAP we employed the official implementations. Diffusion Maps were implemented by us. In the original Diffusion Maps algorithm, Eigenvalue Decomposition (EVD) is applied to the kernel. In our approach, we utilize Singular Value Decomposition (SVD) instead, as previously suggested for Alternating Diffusion (AD) in Talmon & Wu (2019). We chose SVD because it produces results similar to EVD, but with the added advantages of faster computation and greater numerical stability in Num Py s implementation. Furthermore, the optimized SSDM is implemented using torch, leveraging GPU acceleration to perform matrix multiplications in parallel.
Experiment Setup	Yes	We evaluated SDM and SSDM by training a KNN regressor on the top four spectral components for each t in 0, 0.01, . . . , 1.0 using the training set and assessing performance on the test set using angular Mean Absolute Error (MAE). For each dataset and algorithm, we reduced the original data dimensionality to a range of 1 to 30 dimensions. We then trained KNN models 5 neighbors for regression and 1 neighbor for classification. To select the ϵ for the Gaussian kernels of the data and label kernel for SDM and SSDM, we applied the following heuristic: we calculated 16 kernels using different ϵ values from 10 5 to 1010, and chose the ϵ value that resulted in the kernel with the largest number of eigenvalues in the range [0.0001, 0.9999]. We selected the flow resolution parameter, which is the number of sampled kernels along the interpolation, defining {ti}l i=1 in the interval [0, 1], to be 101 for all evaluated datasets. This means that we generated embeddings for each test and train sample along the interpolation for all kernels at ti in 0, 0.01, 0.02, . . . , 0.99, 1.0, then selected the optimal t from this set based on the first data split, as described in Subsection 6.2.1. As for the distance metric, we used only Euclidean distance for both the data kernel and the label kernel.