Lightspeed Geometric Dataset Distance via Sliced Optimal Transport
Authors: Khai Nguyen, Hai Nguyen, Tuan Pham, Nhat Ho
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 4.1, we conduct analysis for the s-OTDD in comparing subsets from MNIST and CIFAR10, while being faster than existing competitors. Additionally, dataset distances are valuable in emerging areas such as synthetic data evaluation, 3D shape comparison, and federated learning, where comparing heterogeneous data distributions is fundamental. We randomly split MNIST and CIFAR10 to create subdataset pairs, each ranging in size from 5,000 to 10,000. We evaluate the proposed method and OTDD (Exact), with results presented in Figure 8. |
| Researcher Affiliation | Collaboration | 1Department of Statistics and Data Sciences, University of Texas at Austin, Texas, USA 2Qualcomm AI Research, Qualcomm Vietnam Company Limited. Correspondence to: Khai Nguyen <EMAIL>. |
| Pseudocode | Yes | We refer the read to Algorithm 1 in Appendix B for a detailed computational algorithm. ... Algorithm 1 Computational Algorithm for s-OTDD |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its source code, nor does it include a link to a code repository. The phrase "1All datasets and models were downloaded and evaluated at Movian AI or University of Texas at Austin" refers to using external resources, not releasing their own implementation. |
| Open Datasets | Yes | We conduct analysis for the s-OTDD in comparing subsets from MNIST and CIFAR10. We apply the proposed method to transfer learning, following the OTDD framework... NIST datasets (Deng, 2012) and a diverse set of text datasets (Zhang et al., 2015) including AG News, DBPedia, Yelp Reviews (with both 5-way classification and binary polarity labels), Amazon Reviews (with both 5-way classification and binary polarity labels), and Yahoo Answers. ...Split Tiny-Image Net (Le & Yang, 2015). |
| Dataset Splits | No | The paper describes how subsets were created for comparison tasks or how many samples were drawn for distance calculation (e.g., "We randomly split MNIST and CIFAR10 to create subdataset pairs, each ranging in size from 5,000 to 10,000.", "We limit the target dataset to 100 examples per class,", "We randomly divide the Tiny Image Net (Le & Yang, 2015) dataset into 10 disjoint tasks, each containing 20 classes."). However, it does not specify standard training, validation, and testing splits typically required for reproducing model training experiments. |
| Hardware Specification | Yes | For the runtime experiments and distance computations, we conducted tests using 8 CPU cores with 128GB of memory. For model training experiments, such as training BERT and Res Net, we used an NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions specific models (BERT, Res Net-18, Le Net-5) and datasets but does not specify any software libraries (e.g., PyTorch, TensorFlow, scikit-learn) or programming language versions with their respective version numbers. |
| Experiment Setup | Yes | We use a simplified Le Net-5, freezing the convolutional layers while fine-tuning the fully connected ones. ... We limit the target dataset to 100 examples per class, fine-tune BERT on the source domain... We initially train a Res Net-18 model on each task... freeze all layers except the final fully connected layer to fine-tune on the target task. ...augmentations for Tiny-Image Net include random variations in brightness, contrast, saturation (0.1-0.9), and hue (0-0.5). For all experiments, we use s-OTDD with k = 5 and σ(Λk) be the product of k Truncated Poisson distributions which have the corresponding rate parameters be 1, . . . , 5 in turn. |