Smoothing the Shift: Towards Stable Test-Time Adaptation under Complex Multimodal Noises

Authors: Zirun Guo, Tao Jin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two public datasets show the effectiveness and superiority over existing methods under the complex noise patterns in multimodal data. Code is available at https://github.com/zrguo/Su Mi.
Researcher Affiliation Academia Zirun Guo Tao Jin Zhejiang University EMAIL
Pseudocode Yes Algorithm 1 Su Mi
Open Source Code Yes Code is available at https://github.com/zrguo/Su Mi.
Open Datasets Yes Datasets. We use two widely used multimodal datasets, Kinetics50 (Kay et al., 2017) and VGGSound (Chen et al., 2020) for evaluation. Following previous work (Hendrycks & Dietterich, 2019; Yang et al., 2024), we introduce 15 different types of corruptions and 6 types for audio to simulate the distribution shifts in real-world applications.
Dataset Splits Yes Following Yang et al. (2024), we use a subset of Kinetics which consists of 50 classes, 29,204 training pairs and 2,466 test pairs.
Hardware Specification No The paper does not provide specific hardware details for running its experiments. It mentions using a pre-trained model and an optimizer but no information about GPUs, CPUs, or other computing resources.
Software Dependencies No The paper does not provide specific software dependency details with version numbers. It mentions using the Adam optimizer and the pre-trained CAV-MAE model, but no versions for frameworks like PyTorch or TensorFlow, or other libraries.
Experiment Setup Yes We use Adam optimizer with a learning rate of 1e-4/1e-5 and batch size of 16/64 for Kinetics50-C and VGGSound-C, respectively. The multimodal threshold γm in Equation 4 and the normalization factor Ent0 in Equation 7 are set to 0.4 ln C following Niu et al. (2022) by default where C is the number of task classes. The unimodal threshold γu in Equation 4 is set to e 1 by default. The smoothing coefficient β is set to 0.6/0.9, the weighting term λ is set to 5.0 and the unimodal assistance µ is set to 1.0 by default for Kinetics50-C and VGGSound-C. For strong OOD adaptation, we set the mutual information sharing term t0 as iter/2. Following previous work (Niu et al., 2023; Gong et al., 2023a; Chen et al., 2024; Guo et al., 2024b), we update the affine parameters of normalization layers.