reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Out-of-Distribution Detection with Markov Logic Networks

Authors: Konstantin Kirchheim, Frank Ortmeier

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on multiple datasets, we demonstrate that MLNs can significantly enhance the performance of a wide range of existing OOD detectors while maintaining computational efficiency.
Researcher Affiliation	Academia	1Department of Computer Science, University of Magdeburg, Germany. Correspondence to: Konstantin Kirchheim <EMAIL>.
Pseudocode	Yes	Alg. 1 provides an overview of the score computation, which will be described in the following. [...] To address this, we adopt a greedy search strategy, described in Alg. 2.
Open Source Code	Yes	The source code for our experiments is available online.1 1https://github.com/kkirchheim/mln-ood
Open Datasets	Yes	The German Traffic Sign Recognition Benchmark (GTSRB) dataset (Stallkamp et al., 2012) contains approximately 40,000 images of German traffic signs spanning 43 classes. [...] The Celeb A dataset (Liu et al., 2015) comprises approximately 200,000 images with 40 binary attribute annotations, covering concepts such as gender, age, the presence of facial hair, and more. [...] As OOD test data, we use images from 8 different sources that cover near and far OOD data, including cropped and resized variants of the LSUN (Yu et al., 2015) and the Tiny Image Net datasets, Gaussian and Uniform Noise, Places356 (Zhou et al., 2017), and iNaturalist (Van Horn et al., 2018).
Dataset Splits	Yes	The original training set is split into 35,000 images for training, and 4,209 images for validation, with all pictures resized to 32 x 32. [...] We split the data into 150000, 2599, and 50000 images for training, validation, and testing, respectively.
Hardware Specification	Yes	Fig. 5 depicts inference time and batch size on the GTSRB, averaged over 100 batches on an Nvidia A100.
Software Dependencies	No	We implement a compiler that transforms constraints formulated in a humanunderstandable format, such as class=stop sign -> color=red and shape=octagon, into Py Torch operations. The paper mentions "Py Torch operations" but does not provide specific version numbers for PyTorch or any other key software libraries used.
Experiment Setup	Yes	The DNNs, which were pre-trained on a downscaled variant of the Image Net, are then further trained for ten epochs using mini-batch SGD with a Nesterov momentum of 0.9, an initial learning rate of 0.01 with a cosine annealing schedule (Loshchilov & Hutter, 2017), and a batch size of 32. [...] MLNs parameters w, which we initialize with 1, are optimized by minimizing the negative log-likelihood... using the L-BFGS optimizer for 10 epochs with a learning rate of 0.01.