Distributionally Robust Coreset Selection under Covariate Shift
Authors: Tomonari Tanaka, Hiroyuki Hanada, Hanting Yang, Aoyama Tatsuya, Yu Inatsu, Akahane Satoshi, Yoshito Okura, Noriaki Hashimoto, Taro Murayama, Hanju Lee, Shinya Kojima, Ichiro Takeuchi
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we focus on covariate shift, a type of data distribution shift, and demonstrate the effectiveness of DRCS through experiments. |
| Researcher Affiliation | Collaboration | Tomonari Tanaka EMAIL Nagoya University Hiroyuki Hanada EMAIL Nagoya University RIKEN Hanting Yang EMAIL Nagoya University Tatsuya Aoyama EMAIL Nagoya University Yu Inatsu EMAIL Nagoya Institute of Technology Satoshi Akahane EMAIL Nagoya University Yoshito Okura EMAIL Nagoya University Noriaki Hashimoto EMAIL RIKEN Taro Murayama EMAIL DENSO CORPORATION Hanju Lee EMAIL DENSO CORPORATION Shinya Kojima EMAIL DENSO CORPORATION Ichiro Takeuchi EMAIL Nagoya University RIKEN |
| Pseudocode | Yes | The details of these algorithm is given in Appendix C.4. In this Appendix, we provide the pseudocode of them. ... Algorithm 1 Distributionally Robust Coreset Selection for Small Datasets ... Algorithm 2 Distributionally Robust Coreset Selection for Large Datasets ... Algorithm 3 Distributionally Robust Coreset Selection for Large Datasets |
| Open Source Code | No | The paper lists libraries used like Num Py (Harris et al., 2020), CVXPY (Diamond & Boyd, 2016), Sci Py (Virtanen et al., 2020), Py Torch (Paszke et al., 2017), neural-tangents (Novak et al., 2020), and scikit-learn (Pedregosa et al., 2011). However, it does not contain an explicit statement by the authors that they are releasing their own source code for the methodology described in the paper, nor does it provide a direct link to a code repository for their implementation. |
| Open Datasets | Yes | The datasets are primarily designed for binary classification, and for multi-class datasets, two classes are extracted and used. ... The datasets listed in Table 1. ... All of the datasets are downloaded from LIBSVM dataset (Chang & Lin, 2011). We used training datasets only if test datasets are provided separately (splice). ... In this paper, we evaluate the DRCS method using the CIFAR10 dataset (Krizhevsky, 2009). |
| Dataset Splits | Yes | We perform cross-validation with a training-to-test data ratio of 4:1 in the experiments. ... CIFAR10 consists of 50,000 training instances and 10,000 test instances, divided into 10 categories. Since the proposed method is designed for binary classification, we extract a subset of the CIFAR10 dataset. Specifically, we create a binary classification dataset consisting of 40,000 training instances categorized as vehicles (airplane, automobile, ship, truck) and animals (cat, deer, dog, horse). The test dataset is similarly divided into two classes, resulting in a total of 8,000 test instances. |
| Hardware Specification | Yes | We used the following computers for experiments: For experiments except for the image dataset, we run experiments on a computer with Intel Xeon Silver 4214R (2.40GHz) CPU and 64GB RAM. For experiments using the image dataset, we run experiments on a computer with Intel(R) Xeon(R) Gold 6338 (2.00GHz) CPU, NVIDIA RTX A6000 GPU and 1TB RAM. |
| Software Dependencies | No | The paper lists several libraries (Num Py, CVXPY, Sci Py, Py Torch, neural-tangents, scikit-learn) along with citations including publication years. However, it does not provide specific version numbers for these software components as used in the experiments (e.g., 'PyTorch 1.9', 'NumPy 1.20'), only the publication year of the paper describing the library. |
| Experiment Setup | Yes | Specifically, for all experiments, we use a batch size of 128, a learning rate of 0.01, weight decay of 0.001, and train the model using the Adam optimizer for 100 epochs. ... The choice of the regularization hyperparameter λ, based on the characteristics of the data, is as follows: We set λ as n, n 10 1.5, n 10 3.0 and best λ which is decided by cross-validation. ... The choice of the hyperparameter in RBF kernel is fixed as follows: we set ζ = d V(Z) as suggested in sklearn.svm.SVC of scikit-learn (Pedregosa et al., 2011), where V denotes the elementwise sample variance. |