FoMo-0D: A Foundation Model for Zero-shot Tabular Outlier Detection

Authors: Yuchen Shen, Haomin Wen, Leman Akoglu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on 57 real-world datasets against 26 baselines show that Fo Mo-0D is highly competitive; outperforming the majority of the baselines with no statistically significant difference from the 2nd best method. Further, Fo Mo-0D is efficient in inference time requiring only 7.7 ms per sample on average, with at least 7x speedup compared to previous methods.
Researcher Affiliation Academia Yuchen Shen EMAIL Carnegie Mellon University Haomin Wen EMAIL Carnegie Mellon University Leman Akoglu EMAIL Carnegie Mellon University
Pseudocode Yes Details are outlined in Algorithm 1 in Appendix C, and described as follows. At each time, we first draw a hypothesis (i.e. GMM configuration) uniformly at random, that is, ϕ = {d [D], m [M], {µj}m j=1 [ 5, 5]d, {Σj}m j=1; diag(Σj) [ 5, 5]d}, and then generate a synthetic dataset D = {Din, Dout} containing synthetic inlier and outlier samples from the drawn hypothesis and its variance-inflated variant, respectively. We optimize Fo Mo-0D s parameters θ to make predictions on Dtest = {Din test, Dout test}, conditioned on the inlier-only training data Dtrain Din based on the cross-entropy loss (see Eq. (2)). During training, Dtest contains a balanced number of inlier and outlier samples, where Din test = Din\Dtrain, and Dout test Dout contains an equal number of samples as Din test. To vary the training data size, we subsample Dtrain of randomly drawn size n [n L, n U], where n L and n U denote the lower and upper bounds. In our implementation, we use n L = 500, and n U = 5, 000. Fo Mo-0D is trained on 200, 000 batches (200 epochs × 1, 000 steps/epoch) of B = 8 generated datasets in each batch. While this pre-training phase can be expensive, it is done only once, offline. Moreover, we introduce several scalability improvements to speed up pre-training, as discussed later in Section 3.3. Full details on the training and implementation of Fo Mo-0D are given in Appendix C.
Open Source Code Yes To facilitate future research, our implementations for data synthesis and pre-training as well as model checkpoints are openly available at https://github.com/A-Chicharito-S/Fo Mo-0D.
Open Datasets Yes While pre-training is purely on synthetic datasets, we evaluate Fo Mo-0D on 57 real-world datasets from ADBench (Han et al., 2022) (see Table 20 in Appendix J).
Dataset Splits Yes Following Livernoche et al. (2024), we use 5 train/test splits of each dataset via different seeds and report mean performance and standard deviation. In particular, each random split designates 50% of the inliers as Dtrain, while Dtest contains the rest of the inliers and all the outlier samples.
Hardware Specification Yes We base our experiments on a NVIDIA RTX A6000 GPU with AMD EPYC 7742 64-Core Processors.
Software Dependencies No We train our models for 200 epochs with the Adam optimizer (Kingma & Ba, 2017) and a learning_rate = 0.001, and test with the model corresponding to the lowest training loss.
Experiment Setup Yes We train our models for 200 epochs with the Adam optimizer (Kingma & Ba, 2017) and a learning_rate = 0.001, and test with the model corresponding to the lowest training loss. The size of our D = {20, 100} model is 4.87M and 4.89M parameters, respectively. ... Model architecture We use a 4-layer Transformer with hidden dimension h_dim = 256, a linear embedding layer at the input (RD Rh_dim), and a 2-layer MLP layer at the output (Rh_dim R2) for inlier vs. outlier binary classification. For each Transformer layer, we use num_head = 4 for each attention module and R = 500 for the router-based attention (Figure 2).