reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TAROT: Targeted Data Selection via Optimal Transport

Authors: Lan Feng, Fan Nie, Yuejiang Liu, Alexandre Alahi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, highlighting its versatility across various deep learning tasks.
Researcher Affiliation	Academia	1EPFL, Switzerlanzd 2Stanford, USA. Correspondence to: Yuejiang Liu <EMAIL>, Alexandre Alahi <EMAIL>.
Pseudocode	Yes	Algorithm 1 Fixed-Size Selection Algorithm 2 OT-Distance Minimization Selection (OTM)
Open Source Code	Yes	Code is available at: https: //github.com/vita-epfl/TAROT.
Open Datasets	Yes	Following Park et al. (2023), we evaluate image classification using Res Net-9 classifiers trained on the CIFAR-10 dataset. For motion prediction, we adopt Auto Bots (Girgis et al., 2021), training on the nu Scenes (Caesar et al., 2020) dataset (32k samples) and validating on 9k target samples. The GTA5 dataset (Richter et al., 2016) serves as the candidate dataset, while the Cityscapes (Cordts et al., 2016) training split (2975 samples) is used as the target dataset, with its validation split for evaluation. We use the Uni Traj framework (Feng et al., 2024) for unified training and evaluation across multiple datasets, including Waymo Open Motion (WOMD) (Ettinger et al., 2021), Argoverse 2 (Wilson et al., 2021), nu Scenes (Caesar et al., 2020), and nu Plan (H. Caesar, 2021). We utilize the same candidate dataset, comprising FLAN V2 (Longpre et al., 2023), COT (Wei et al., 2022), DOLLY (Conover et al., 2023) and OPEN ASSISTANT 1 (K opf et al., 2024), with MMLU (Hendrycks et al., 2021b;a) and BBH (Suzgun ets al., 2023) serving as the target tasks for evaluation.
Dataset Splits	Yes	The Cityscapes (Cordts et al., 2016) training split (2975 samples) is used as the target dataset, with its validation split for evaluation. The nu Scenes training set (32k samples) serves as the target dataset, while the candidate pool comprises WOMD, Argoverse 2, and nu Plan. From nu Plan, we filter trajectories with a moving distance over 2 meters, yielding 1000k samples. We use the official training splits of WOMD and Argoverse 2, including 2000k samples. The evaluation is conducted on the nu Scenes validation set. We evaluate selection ratios of 5%, 20%, 50% and OTM (Section 3.3). OTM selects approximately 24% of the data. OT-Distance Minimization Selection (OTM): The target dataset Dt is randomly split into k equal subsets. In each fold, 1/k of Dt is used for selection, while the OT distance is evaluated against the remaining (k 1)/k data. ... In our experiments, k = 10 ensures a good match with the target distribution while avoiding overfitting.
Hardware Specification	Yes	Table 8: The wall clock runtime (measured as single H100 GPU hours) of TAROT compared with LESS and TSDS on instruction tuning task.
Software Dependencies	No	The paper mentions specific models and frameworks used (Deep Lab V3, Res Net50, Auto Bots, Wayformer, LLAMA-3.1-8B, QWEN-2.5-7B) but does not provide specific version numbers for underlying software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup	Yes	Table 3: Training Hyperparameters for Semantic Segmentation Table 4: Experiment Settings for Motion Prediction Table 5: Training Hyperparameters for Instruction Tuning