FoMo-0D: A Foundation Model for Zero-shot Tabular Outlier Detection
Authors: Yuchen Shen, Haomin Wen, Leman Akoglu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on 57 real-world datasets against 26 baselines show that Fo Mo-0D is highly competitive; outperforming the majority of the baselines with no statistically significant difference from the 2nd best method. Further, Fo Mo-0D is efficient in inference time requiring only 7.7 ms per sample on average, with at least 7x speedup compared to previous methods. |
| Researcher Affiliation | Academia | Yuchen Shen EMAIL Carnegie Mellon University Haomin Wen EMAIL Carnegie Mellon University Leman Akoglu EMAIL Carnegie Mellon University |
| Pseudocode | Yes | Details are outlined in Algorithm 1 in Appendix C, and described as follows. At each time, we first draw a hypothesis (i.e. GMM configuration) uniformly at random, that is, ϕ = {d [D], m [M], {µj}m j=1 [ 5, 5]d, {Σj}m j=1; diag(Σj) [ 5, 5]d}, and then generate a synthetic dataset D = {Din, Dout} containing synthetic inlier and outlier samples from the drawn hypothesis and its variance-inflated variant, respectively. We optimize Fo Mo-0D s parameters θ to make predictions on Dtest = {Din test, Dout test}, conditioned on the inlier-only training data Dtrain Din based on the cross-entropy loss (see Eq. (2)). During training, Dtest contains a balanced number of inlier and outlier samples, where Din test = Din\Dtrain, and Dout test Dout contains an equal number of samples as Din test. To vary the training data size, we subsample Dtrain of randomly drawn size n [n L, n U], where n L and n U denote the lower and upper bounds. In our implementation, we use n L = 500, and n U = 5, 000. Fo Mo-0D is trained on 200, 000 batches (200 epochs × 1, 000 steps/epoch) of B = 8 generated datasets in each batch. While this pre-training phase can be expensive, it is done only once, offline. Moreover, we introduce several scalability improvements to speed up pre-training, as discussed later in Section 3.3. Full details on the training and implementation of Fo Mo-0D are given in Appendix C. |
| Open Source Code | Yes | To facilitate future research, our implementations for data synthesis and pre-training as well as model checkpoints are openly available at https://github.com/A-Chicharito-S/Fo Mo-0D. |
| Open Datasets | Yes | While pre-training is purely on synthetic datasets, we evaluate Fo Mo-0D on 57 real-world datasets from ADBench (Han et al., 2022) (see Table 20 in Appendix J). |
| Dataset Splits | Yes | Following Livernoche et al. (2024), we use 5 train/test splits of each dataset via different seeds and report mean performance and standard deviation. In particular, each random split designates 50% of the inliers as Dtrain, while Dtest contains the rest of the inliers and all the outlier samples. |
| Hardware Specification | Yes | We base our experiments on a NVIDIA RTX A6000 GPU with AMD EPYC 7742 64-Core Processors. |
| Software Dependencies | No | We train our models for 200 epochs with the Adam optimizer (Kingma & Ba, 2017) and a learning_rate = 0.001, and test with the model corresponding to the lowest training loss. |
| Experiment Setup | Yes | We train our models for 200 epochs with the Adam optimizer (Kingma & Ba, 2017) and a learning_rate = 0.001, and test with the model corresponding to the lowest training loss. The size of our D = {20, 100} model is 4.87M and 4.89M parameters, respectively. ... Model architecture We use a 4-layer Transformer with hidden dimension h_dim = 256, a linear embedding layer at the input (RD Rh_dim), and a 2-layer MLP layer at the output (Rh_dim R2) for inlier vs. outlier binary classification. For each Transformer layer, we use num_head = 4 for each attention module and R = 500 for the router-based attention (Figure 2). |