Leveraging Gradients for Unsupervised Accuracy Estimation under Distribution Shift

Authors: RENCHUNZI XIE, Ambroise Odonnat, Vasilii Feofanov, Ievgen Redko, Jianfeng Zhang, Bo An

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted with various architectures on diverse distribution shifts demonstrate that our method significantly outperforms current state-of-the-art approaches. The code is available at https://github.com/Renchunzi-Xie/Gd Score. ... Section 5 Experiments
Researcher Affiliation Collaboration Renchunzi Xie EMAIL College of Computing and Data Science Nanyang Technological University; Ambroise Odonnat EMAIL Huawei Noah s Ark Lab, Inria Paris, France
Pseudocode Yes B Pseudo-code of Gd Score Our proposed Gd Score for unsupervised accuracy estimation can be calculated as shown in Algorithm 1. Algorithm 1: Unsupervised Accuracy Estimation via Gd Score
Open Source Code Yes The code is available at https://github.com/Renchunzi-Xie/Gd Score.
Open Datasets Yes For pre-training the neural network, we use CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009), Tiny Image Net (Le & Yang, 2015), Image Net (Deng et al., 2009), Office-31 (Saenko et al., 2010), Office-Home (Venkateswara et al., 2017), Camelyon17-WILDS (Koh et al., 2021), and BREEDS (Santurkar et al., 2020) ... we use CIFAR-10C, CIFAR-100C, and Image Net-C (Hendrycks & Dietterich, 2019) ... Tiny Image Net-C (Hendrycks & Dietterich, 2019)
Dataset Splits No The paper mentions using specific datasets like CIFAR-10C, CIFAR-100C, Image Net-C which span 19 types of corruption across 5 severity levels, and Tiny Image Net-C with 15 types of corruption and 5 severity levels. It also refers to
Hardware Specification No The paper states: "To show the versatility of our approach across different architectures, we perform all our experiments on Res Net18, Res Net50 (He et al., 2016) and WRN-50-2 (Zagoruyko & Komodakis, 2016) models." However, no specific hardware (e.g., GPU, CPU models, or memory) used for these experiments is mentioned.
Software Dependencies No The paper mentions "SGD with a learning rate of 10 3, cosine learning rate decay (Loshchilov & Hutter, 2016), a momentum of 0.9, and a batch size of 128." These are algorithmic parameters. No specific software libraries or frameworks with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x) are provided, which would be necessary for reproducibility.
Experiment Setup Yes Training details. To show the versatility of our approach across different architectures, we perform all our experiments on Res Net18, Res Net50 (He et al., 2016) and WRN-50-2 (Zagoruyko & Komodakis, 2016) models. We train them for 20 epochs for CIFAR-10 (Krizhevsky & Hinton, 2009) and 50 epochs for the other datasets. In all cases, we use SGD with a learning rate of 10 3, cosine learning rate decay (Loshchilov & Hutter, 2016), a momentum of 0.9, and a batch size of 128. For all experiments, we used p = 0.3 to compute Gd Score.