reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unsupervised Domain Adaptation by Learning Using Privileged Information

Authors: Adam Breitholtz, Anton Matsson, Fredrik D. Johansson

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the empirical benefits of learning using privileged information, compared to the other data availability settings in Table 1, across four UDA image classification tasks where PI is available in the forms described in Section 3. Widely used datasets for UDA evaluation like Office Home (Venkateswara et al., 2017) and large-scale benchmark suites like Domain Bed (Gulrajani & Lopez-Paz, 2021), Vis DA (Peng et al., 2017) and WILDS (Koh et al., 2021) do not include privileged information and cannot be used for evaluation here. Thus, we first compare our method to baselines on the recent Celeb A task (Xie et al., 2020) which includes PI in the form of binary attributes (Section 4.1). Additionally, we propose three new tasks based on well-known image classification data sets with regions of interest as PI (Section 4.2 4.4). In Section 4.1 and 4.2, we use the two-stage estimator with the subnetwork ˆf based on the Res Net-18 architecture (He et al., 2016a). In Section 4.3 and 4.4, we use our variant of Faster R-CNN with a Res Net-50 backbone.
Researcher Affiliation	Academia	Adam Breitholtz EMAIL Department of Computer Science Chalmers University of Technology Anton Matsson EMAIL Department of Computer Science Chalmers University of Technology Fredrik D. Johansson EMAIL Department of Computer Science Chalmers University of Technology
Pseudocode	Yes	Algorithm 1 Training of the two-stage model. 1: procedure Two_stage ( xi, wi, ti, yi) 2: Empirically minimize 1 m Pm i=1 d( xi) ti 2 and obtain ˆd. 3: Empirically minimize 1 n Pn i=1 CCE(g(wi), yi) and obtain ˆg. 4: Compose ˆd, ˆg and ϕ into ˆh(x) = ˆg(ϕ(x, ˆd(x))). 5: end procedure
Open Source Code	Yes	All models were trained using NVIDIA Tesla A40 GPUs and the development and evaluation of this study required approximately 30,000 hours of GPU training. The code is available on Git Hub: https://github. com/Healthy-AI/dalupi.
Open Datasets	Yes	We use the Chest Xray8 dataset (Wang et al., 2017) as source domain and the Che Xpert dataset (Irvin et al., 2019) as target domain.1 As PI, we use the regions of pixels associated with each found pathology, as annotated by domain experts using bounding boxes. For the Che Xpert dataset, only pixel-level segmentations are available, and we create bounding boxes that tightly enclose the segmentations.
Dataset Splits	Yes	We use a subset of the Celeb A dataset with 2,000 labeled source examples and 3,000 unlabeled target examples. We use 1,000 samples each for the source validation set, source test set, and target test set, respectively. The target oracle, SL-T, is trained using labels provided for the 3,000 target examples, with 20 % of these examples set aside for validation. The same unlabeled validation set is used to validate the first DALUPI network, ˆf.
Hardware Specification	Yes	All models were trained using NVIDIA Tesla A40 GPUs and the development and evaluation of this study required approximately 30,000 hours of GPU training.
Software Dependencies	No	The paper mentions software like Python, PyTorch, skorch, torchvision, TensorFlow, and ADAPT, but does not provide specific version numbers for these components. For example, it says "skorch (Tietz et al., 2017)" but not "skorch X.Y.Z".
Experiment Setup	Yes	For each task and task-specific setting (label skew, amount of privileged information, etc.), we train 10 models from each relevant class using hyperparameters randomly selected from given ranges (see Appendix A). For DANN and MDD, the trade-off parameter, which regularizes domain discrepancy in representation space, increases from 0 to 0.1 during training; for MDD, the margin parameter is set to 3. All models are evaluated on a held-out validation set from the source domain and the best-performing model in each class is then evaluated on held-out test sets from both domains. For SL-T, we use a held-out validation set from the target domain. We repeat this procedure over 5 or 10 seeds, controlling the data splits and the random number generation. We report accuracy and area under the ROC curve (AUC) with 95 % confidence intervals computed by bootstrapping over the seeds.