Tree-Wasserstein Distance for High Dimensional Data with a Latent Feature Hierarchy

Authors: Ya-Wei Eileen Lin, Ronald Coifman, Gal Mishne, Ronen Talmon

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTAL RESULTS We apply the proposed method to a synthetic example and to word-document and single-cell RNA-sequencing (sc RNA-seq) data. ... Table 1 presents the document classification accuracy. ... E.3 ABLATION STUDIES We note that our method consists of several non-trivial steps. We provide theoretical justification for our method in Sec. 4. Specifically, in Thm. 2, we show that the TWD TW( , , T) based on the hidden feature tree T can be approximated by our TWD using the decoded tree B, where the proposed TWD and B are obtained solely from the data matrix, without access to ground truth T. In addition, in Thm. 1, we show that the tree metric of the decoded tree B is bilipschitz equivalent to the tree metric of the hidden tree T. These results stem directly from the particular steps of the proposed method and might not exist otherwise. In the following, we performed a comprehensive empirical ablation study, demonstrating the importance of each component in our method.
Researcher Affiliation Academia Ya-Wei Eileen Lin1 Ronald R. Coifman2 Gal Mishne3 Ronen Talmon1 1Viterbi Faculty of Electrical and Computer Engineering, Technion 2Department of Mathematics, Yale University 3Halicioˇglu Data Science Institute, University of California San Diego
Pseudocode Yes Algorithm 1 HD Binary Tree Decoding Input: Embedding {{zk 1}, . . . , {zk m}}Kc k=0 Output: B = (e V , e E, e A) with m leaf nodes function HD_BT({{zk 1}, . . . , {zk m}}Kc k=0) B leaves({j} : j [m]) for j, j [m] do ... Algorithm 2 TWD with a latent feature hierarchy Input: Data matrix X Rn m, feature diffusion operator P, and maximal scale Kc Output: TWD W Rn n function TWD_latent_Hie Feature(X, Kc)
Open Source Code Yes Our code is available at https://github.com/ya-wei-eileen-lin/ TWDwith Latent Feature Hierarchy.
Open Datasets Yes We test four word-document benchmarks in Kusner et al. (2015): (i) the BBCSPORT ... (ii) the TWITTER ... (iii) the CLASSIC ... and (iv) the AMAZON ... The Word2Vec embedding (Mikolov et al., 2013) is used as the word embedding vectors that are trained on the Google News1 dataset. ... We test two sc RNA-seq datasets in Dumitrascu et al. (2021), where the datasets are available at the link2. ... The first dataset, the Zeisel dataset, focuses on the mouse cortex and hippocampus (Zeisel et al., 2015). ... The second dataset, the CBMC dataset, is derived from a cord blood mononuclear cell study (Stoeckius et al., 2017). ... 1https://code.google.com/archive/p/word2vec/ 2https://github.com/solevillar/sc Gene Fit-python/tree/ 62f88ef0765b3883f592031ca593ec79679a52b4/sc Gene Fit/data_files
Dataset Splits Yes In our experiments, the datasets are divided into a 70% training set and a 30% testing set. The split of the word-document datasets in Sec. 5.2 follows previous work (Kusner et al., 2015; Huang et al., 2016). This split was applied to both learning of the underlying feature tree and to the k NN model. Therefore, there is no train data leakage to the test data. The feature tree constructed from the training data can be reused to process new samples. For each evaluation, we employ a k NN classifier with varying k {1, 3, 5, 7, 9, 11, 13, 15, 17, 19}. The evaluation is repeated five times, and we report the best result, averaged over these five runs.
Hardware Specification Yes The experiments are performed on NVIDIA DGX A100.
Software Dependencies No As these parameters are task-specific and do not have a one-size-fits-all solution, we use Optuna (Akiba et al., 2019) to efficiently explore the parameter space and identify optimal settings, where (ϵ, Kc) are set (10, 12), (5, 7), (1, 15), (10, 14), (2, 7), and (5, 9) for BBCSPORT, TWITTER, CLASSIC, AMAZON, Zeisel, and CBMC, respectively.
Experiment Setup Yes In our experiments, the datasets are divided into a 70% training set and a 30% testing set. ... For each evaluation, we employ a k NN classifier with varying k {1, 3, 5, 7, 9, 11, 13, 15, 17, 19}. The evaluation is repeated five times, and we report the best result, averaged over these five runs. ... When constructing the Gaussian kernel in Sec. 3, the initial distance is derived from the cosine similarity calculated in the ambient space, and the kernel scale is set to be {0.1, 1, 2, 5, 10} σ, where σ is the median of the pairwise distances. ... The maximal scale Kc determines the range of scales over which the hyperbolic manifold operates. We follow (Lin et al., 2023) and suggest setting Kc {0, 1, . . . , 19}. ... we use Optuna (Akiba et al., 2019) to efficiently explore the parameter space and identify optimal settings, where (ϵ, Kc) are set (10, 12), (5, 7), (1, 15), (10, 14), (2, 7), and (5, 9) for BBCSPORT, TWITTER, CLASSIC, AMAZON, Zeisel, and CBMC, respectively.