reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AnoLLM: Large Language Models for Tabular Anomaly Detection

Authors: Che-Ping Tsai, Ganyu Teng, Phillip Wallis, Wei Ding

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results indicate that Ano LLM delivers the best performance on six benchmark datasets with mixed feature types. Additionally, across 30 datasets from the ODDS library, which are predominantly numerical, Ano LLM performs on par with top performing baselines.
Researcher Affiliation	Industry	Che-Ping Tsai , Ganyu Teng, Phil Wallis, Wei Ding Amazon EMAIL
Pseudocode	No	The paper describes the methods through textual explanations and mathematical equations (e.g., Eqn. 1, Eqn. 5, Eqn. 6) and process descriptions, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an unambiguous statement of releasing its own source code nor provides a direct link to a code repository for the Ano LLM framework. It mentions using Py OD library and Deep OD library for baselines, but not for their proposed method.
Open Datasets	Yes	Datasets: Since popular anomaly detection benchmarks, such as ADBench (Han et al., 2022) and the ODDS library (Rayana, 2016), mainly consist of numerical features, we manually collect six datasets that contain mixed types of features. The six datasets are derived from ODDS library (Rayana, 2016), the fraud dataset benchmarks (Grover et al., 2022) and Kaggle. The dataset statistics are described in Table 1. To demonstrate the ability of Ano LLM to accommodate numerical columns, we also evaluate the approach on 30 datasets from the ODDS library, which are mainly composed of numerical features. The ODDS library is collected from various domains, such as chemistry, healthcare, and astronautics.
Dataset Splits	Yes	Evaluation protocols: Following prior works (Shenkar & Wolf, 2022; Xu et al., 2023b), we conduct experiments in an uncontaminated, unsupervised setting. The training set consists of a random sample of 50% from the pool of normal examples, with the test set comprising the remaining normal examples, along with all anomalies. We randomly split each dataset using 5 different random seeds and reported the averaged results.
Hardware Specification	Yes	Finetuning and inference are performed on seven Nvidia A100 40GB GPUs hosted on Amazon EC2 P4 Instances. [...] The total compute required to train Ano LLM-135M across all datasets with five seeds, including six datasets from the mixed-type benchmark and 30 datasets from the ODDS benchmark, is approximately 90 GPU hours on a single RTX-A6000 GPU with 48 GB of memory.
Software Dependencies	No	The paper mentions using an Adam W optimizer (Loshchilov & Hutter, 2019) and Py OD library (Zhao et al., 2019) and Deep OD library (Xu et al., 2023a) for baselines, and Lo RA adapter (Hu et al., 2022). However, specific version numbers for these software components or other key libraries (like PyTorch, TensorFlow, or Python) are not provided.
Experiment Setup	Yes	Fine-tuning is conducted for 2,000 steps with an Adam W optimizer (Loshchilov & Hutter, 2019) with learning rate 5 × 10−5 across all datasets as the training loss converges uniformly. Batch sizes are adjusted for each dataset to accommodate the varying lengths of serialized data. During inference, we select the number of permutations r = 21 since further increasing r does not result in any observed improvement. [...] Detailed hyperparameters are shown in Table 7 of the Appendix.