TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Modality

Authors: Yinsong Wang, Shahin Shahrampour

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide extensive numerical simulations using real-world datasets to show that TAP can provide statistically significant improvement in generalization across different domains and different neural network architectures, making use of seemingly unusable unlabeled cross-modal data. We provide detailed simulations on three real-world datasets in different domains to examine various aspects of TAP and demonstrate that the integration of TAP into a neural network can provide statistically significant improvement in generalization using the unlabeled modality. We also provide detailed ablation studies to investigate the best configuration for TAP in practice, including the choice of kernel, the choice for latent space transformation, and compatibility with CNN and Transformer-based backbone feature extractors with an additional text-image dataset.
Researcher Affiliation Academia Yinsong Wang EMAIL Department of Mechanical and Industrial Engineering Northeastern University Shahin Shahrampour EMAIL Department of Mechanical and Industrial Engineering Northeastern University
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks. It describes methods using mathematical formulations and descriptive text, accompanied by figures visualizing the architecture.
Open Source Code No The paper mentions that 'Full implementation details for all experiments in this section can be found in the Appendix for reproducibility' but does not provide concrete access to source code such as a repository link or an explicit statement of code release.
Open Datasets Yes Datasets: To ensure a comprehensive evaluation of the performance of TAP integration, we select/create three real-world cross-modal datasets in three different areas. All datasets are open-access and can be found online. A detailed dataset and pre-processing description can be found in the Appendix. Computer Vision: We start with the MNIST dataset (MNIST) (Deng, 2012). Healthcare: We use the Activity dataset (Activity) (Mohino-Herranz et al., 2019), where the Electrodermal Activity (EDA) signals are the primary modality X for predicting the subject activity. Remote Sensing: We also choose the Crop dataset (Crop) (Khosravi et al., 2018; Khosravi & Alavipanah, 2019)... We carry out the test on a fourth dataset, Memotion 7K dataset (Sharma et al., 2020).
Dataset Splits Yes Similar to semi-supervised learning, the motivation behind utilizing unlabeled data points is the limited availability of labeled data. So, we randomly sample 200 data points in the primary modality to serve as the training data for each dataset. We further randomly sample 1000 data points in the secondary modality to serve as the cross-modal reference data. For MNIST, it means 200 upper half images as primary modality training data and 1000 lower half images as reference data. All the remaining data points will be the evaluation data. The memotion 7K dataset requires more training data to learn a model. So, we randomly sample 5000 data as the training set, 1000 data as the reference data, and the rest of data points are the evaluation data. At each Monte-Carlo Simulation, the set of training data, reference data, and evaluation data are shuffled while keeping the amount the same.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. It mentions memory cost associated with the model but not the underlying hardware.
Software Dependencies No The paper mentions the use of 'Py Torch' but does not specify a version number for this or any other software dependency. It also refers to pre-trained models like Efficient Net-B0 and distilled-RoBERTa but not their specific software versions.
Experiment Setup Yes In the performance evaluation, the reference batch size for TAP is chosen as 250, which is 1.25 times the training data. The training data batch size is set to 100... The backbone neural network structure for all three datasets is a two-hidden-layer neural network with 64 hidden neurons at each layer. The activation function is Re LU with a dropout rate of 0.5. Layer normalization is implemented after each hidden layer... All models are trained with the cross-entropy loss using the Adam optimizer with a fixed learning rate of 0.0001. All models are trained for 1000 epochs (8000 in the Crop dataset) except for TAP... The learning rate is set to 0.00002 for both baseline and TAP integration. Each model is trained for 20 epochs, where we observe the validation accuracy stabilizes.