reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tree-Wasserstein Distance for High Dimensional Data with a Latent Feature Hierarchy

Authors: Ya-Wei Eileen Lin, Ronald Coifman, Gal Mishne, Ronen Talmon

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 EXPERIMENTAL RESULTS We apply the proposed method to a synthetic example and to word-document and single-cell RNA-sequencing (sc RNA-seq) data. ... Table 1 presents the document classification accuracy. ... E.3 ABLATION STUDIES We note that our method consists of several non-trivial steps. We provide theoretical justification for our method in Sec. 4. Specifically, in Thm. 2, we show that the TWD TW( , , T) based on the hidden feature tree T can be approximated by our TWD using the decoded tree B, where the proposed TWD and B are obtained solely from the data matrix, without access to ground truth T. In addition, in Thm. 1, we show that the tree metric of the decoded tree B is bilipschitz equivalent to the tree metric of the hidden tree T. These results stem directly from the particular steps of the proposed method and might not exist otherwise. In the following, we performed a comprehensive empirical ablation study, demonstrating the importance of each component in our method.
Researcher Affiliation	Academia	Ya-Wei Eileen Lin1 Ronald R. Coifman2 Gal Mishne3 Ronen Talmon1 1Viterbi Faculty of Electrical and Computer Engineering, Technion 2Department of Mathematics, Yale University 3Halicioˇglu Data Science Institute, University of California San Diego
Pseudocode	Yes	Algorithm 1 HD Binary Tree Decoding Input: Embedding {{zk 1}, . . . , {zk m}}Kc k=0 Output: B = (e V , e E, e A) with m leaf nodes function HD_BT({{zk 1}, . . . , {zk m}}Kc k=0) B leaves({j} : j [m]) for j, j [m] do ... Algorithm 2 TWD with a latent feature hierarchy Input: Data matrix X Rn m, feature diffusion operator P, and maximal scale Kc Output: TWD W Rn n function TWD_latent_Hie Feature(X, Kc)
Open Source Code	Yes	Our code is available at https://github.com/ya-wei-eileen-lin/ TWDwith Latent Feature Hierarchy.
Open Datasets	Yes	We test four word-document benchmarks in Kusner et al. (2015): (i) the BBCSPORT ... (ii) the TWITTER ... (iii) the CLASSIC ... and (iv) the AMAZON ... The Word2Vec embedding (Mikolov et al., 2013) is used as the word embedding vectors that are trained on the Google News1 dataset. ... We test two sc RNA-seq datasets in Dumitrascu et al. (2021), where the datasets are available at the link2. ... The first dataset, the Zeisel dataset, focuses on the mouse cortex and hippocampus (Zeisel et al., 2015). ... The second dataset, the CBMC dataset, is derived from a cord blood mononuclear cell study (Stoeckius et al., 2017). ... 1https://code.google.com/archive/p/word2vec/ 2https://github.com/solevillar/sc Gene Fit-python/tree/ 62f88ef0765b3883f592031ca593ec79679a52b4/sc Gene Fit/data_files
Dataset Splits	Yes	In our experiments, the datasets are divided into a 70% training set and a 30% testing set. The split of the word-document datasets in Sec. 5.2 follows previous work (Kusner et al., 2015; Huang et al., 2016). This split was applied to both learning of the underlying feature tree and to the k NN model. Therefore, there is no train data leakage to the test data. The feature tree constructed from the training data can be reused to process new samples. For each evaluation, we employ a k NN classifier with varying k {1, 3, 5, 7, 9, 11, 13, 15, 17, 19}. The evaluation is repeated five times, and we report the best result, averaged over these five runs.
Hardware Specification	Yes	The experiments are performed on NVIDIA DGX A100.
Software Dependencies	No	As these parameters are task-specific and do not have a one-size-fits-all solution, we use Optuna (Akiba et al., 2019) to efficiently explore the parameter space and identify optimal settings, where (ϵ, Kc) are set (10, 12), (5, 7), (1, 15), (10, 14), (2, 7), and (5, 9) for BBCSPORT, TWITTER, CLASSIC, AMAZON, Zeisel, and CBMC, respectively.
Experiment Setup	Yes	In our experiments, the datasets are divided into a 70% training set and a 30% testing set. ... For each evaluation, we employ a k NN classifier with varying k {1, 3, 5, 7, 9, 11, 13, 15, 17, 19}. The evaluation is repeated five times, and we report the best result, averaged over these five runs. ... When constructing the Gaussian kernel in Sec. 3, the initial distance is derived from the cosine similarity calculated in the ambient space, and the kernel scale is set to be {0.1, 1, 2, 5, 10} σ, where σ is the median of the pairwise distances. ... The maximal scale Kc determines the range of scales over which the hyperbolic manifold operates. We follow (Lin et al., 2023) and suggest setting Kc {0, 1, . . . , 19}. ... we use Optuna (Akiba et al., 2019) to efficiently explore the parameter space and identify optimal settings, where (ϵ, Kc) are set (10, 12), (5, 7), (1, 15), (10, 14), (2, 7), and (5, 9) for BBCSPORT, TWITTER, CLASSIC, AMAZON, Zeisel, and CBMC, respectively.