reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploring Simple, High Quality Out-of-Distribution Detection with L2 Normalization

Authors: Jarrod Haas, William Yolland, Bernhard T Rabus

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results in Table 1 demonstrate that L2 normalization over feature space during training produces results that compare well with state-of-the-art methods. Notably, L2 normalization results not only in large performance gains over non-normalized baselines, but these gains happen much faster (Table 4). Training details for all experiments can be found in A.1.
Researcher Affiliation	Collaboration	Jarrod Haas EMAIL SARlab, Department of Engineering Science Simon Fraser University; William Yolland EMAIL Meta Optima; Bernhard Rabus EMAIL SARlab, Department of Engineering Science Simon Fraser University
Pseudocode	Yes	Algorithm 1: L2 Normalization of Features def forward(self, x): z = self.encoder(x) featurenorm = torch.norm(z).detach().clone() z = torch.Functional.normalize(z, p=2, dim=1) y = self.fc(z) return y, featurenorm; Figure 1: A Pytorch code snippet illustrating the proposed method
Open Source Code	Yes	The Compact Convolutional Transformers were trained from five random initializations with cosine annealing for 300 epochs, in distributed mode parallel with batch sizes of 128. The code for these models and the training regime can be found at https://github.com/SHI-Labs/Compact-Transformers.
Open Datasets	Yes	AUROC scores for baselines trained on CIFAR10 and tested on far Oo D (SVHN) and near Oo D (CIFAR100) data sets... We study cases where feature norms, as a direct measure of input familiarity, can become more useful with L2 normalization. We show that this is at least the case for several architectures trained on CIFAR10, CIFAR100 and Tiny Image Net datasets. ...trained on the German Traffic Sign Recognition Benchmark (GTSRB)
Dataset Splits	Yes	To evaluate models, we merge ID and Oo D images into a single test set. Oo D performance is then a binary classification task, where we measure how well Oo D images can be separated from ID images using a score derived from our model. Our score in this case is the L2 norm of each image s unnormalized feature vector z, which is input to AUROC and FPR95 scoring functions (see Figure 1). ...Plots of feature norms vs softmax scores for all CIFAR10 (blue) and SVHN (orange) test images.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory amounts) are provided in the paper. The text only mentions 'wherever permitted by GPU RAM'.
Software Dependencies	No	The paper mentions 'Pytorch code snippet', 'torch.Functional.normalize', 'SGD optimizer', 'AdamW optimizer', and 'PyTorch implementation' for ConvNeXt, but no specific version numbers for any of these software components are provided.
Experiment Setup	Yes	Training employed an SGD optimizer initialized to a learning rate of 1e-1 with gamma=0.1, and with stepdowns at 40 and 50 epochs for 60 epoch models, 75 and 90 epochs for 100 epoch models, and at 150 and 250 epochs for 350 epoch models. All Res Net models use spectral normalization, global average pooling, and Leaky Re LUs... A batch size of 1024 was used wherever permitted by GPU RAM, but Logit Norm models were trained with a batch size of 128 as per the original paper s recommendations (Wei et al., 2022). Res Net50 used a batch size of 768 for CIFAR10 and 512 for Tiny Image Net. The Compact Convolutional Transformers were trained from five random initializations with cosine annealing for 300 epochs, in distributed mode parallel with batch sizes of 128... It was trained using a single cosine annealing schedule with the Adam W optimizer.