reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mahalanobis++: Improving OOD Detection via Feature Normalization

Authors: Maximilian Müller, Matthias Hein

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on 44 models across diverse architectures and pretraining schemes show that ℓ2-normalization improves the conventional Mahalanobis distance-based approaches significantly and consistently, and outperforms other recently proposed OOD detection methods. Code is available at github.com/mueller-mp/maha-norm. (...) 5. Experiments Image Net. Our main goal is to investigate the effectiveness of Mahalanobis++ across a large pool of architectures, model sizes and training schemes for Image Net-scale OOD detection, as this is where the conventional Mahalanobis distance showed the most varied results in previous studies (...) We report the false positive rate at a true positive rate of 95% (FPR) as the OOD detection metric and refer to the appendix for other metrics, such as AUC, details on the model checkpoints, baseline methods, and extended results.
Researcher Affiliation	Academia	1University of T ubingen and T ubingen AI Center. Correspondence to: Maximilian M uller <EMAIL>.
Pseudocode	No	The paper describes the methodology and evaluation steps using mathematical formulations and textual descriptions (e.g., Section 3.1 Mahalanobis Distance, equations 1-4) but does not include a distinct 'Pseudocode' or 'Algorithm' block with structured, code-like steps.
Open Source Code	Yes	Code is available at github.com/mueller-mp/maha-norm.
Open Datasets	Yes	Extensive experiments on 44 models across diverse architectures and pretraining schemes show that ℓ2-normalization improves the conventional Mahalanobis distance-based approaches significantly and consistently, and outperforms other recently proposed OOD detection methods. Code is available at github.com/mueller-mp/maha-norm. (...) Following the Open OOD setup (Yang et al., 2022), we report results on Ninco (Bitterwolf et al., 2023), i Naturalist (Van Horn et al., 2018), SSB-hard (Vaze et al., 2022), Open Images-O (Krasin et al., 2017) and Texture (Cimpoi et al., 2014). (...) We investigate Mahalanobis++ on CIFAR100 (Krizhevsky, 2009), following the Open OOD setup with tiny Image Net (Le & Yang, 2015), Mnist (Le Cun et al., 1998), SVHN (Netzer et al., 2011), Texture (Cimpoi et al., 2014), Places (Zhou et al., 2017) and Cifar10 as OOD datasets for a range of architectures and training schemes.
Dataset Splits	Yes	Following the Open OOD setup (Yang et al., 2022), we report results on Ninco (Bitterwolf et al., 2023), i Naturalist (Van Horn et al., 2018), SSB-hard (Vaze et al., 2022), Open Images-O (Krasin et al., 2017) and Texture (Cimpoi et al., 2014). (...) If s Maha(xt) T then the sample is rejected as OOD, where for evaluation purposes T is typically determined by fixing a TPR of 95% on the in-distribution. (...) Given the training set (xi, yi)n i=1 with input xi and class labels yi (...) OOD test samples (i.e. samples that were not used for estimating means and covariance).
Hardware Specification	No	The paper does not provide specific details regarding the hardware (e.g., GPU, CPU models, or memory specifications) used for conducting the experiments.
Software Dependencies	No	The paper mentions using 'publicly available model checkpoints from timm (Wightman, 2019) and huggingface.co' but does not specify version numbers for these or any other software dependencies (e.g., Python, PyTorch, CUDA) required for reproducibility.
Experiment Setup	Yes	If s Maha(xt) T then the sample is rejected as OOD, where for evaluation purposes T is typically determined by fixing a TPR of 95% on the in-distribution. (...) Like suggested in (Sun et al., 2022), we use K = 1000. (...) As suggested in (Wang et al., 2022), we set the threshold r such that 1% of the activations from the train set would be truncated. (...) Like suggested by the authors, we use 1% of the train features and K = 10 neighbors for Image Net experiments. (...) Like suggested in (Wang et al., 2022), we use D = 1000 if the dimensionality of the feature space d is d 2048, D = 512 if 2048 d 768, and D = d/2 rounded to integers otherwise.