reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TabWak: A Watermark for Tabular Diffusion Models

Authors: Chaoyi Zhu, Jiayi Tang, Jeroen Galjaard, Pin-Yu Chen, Robert Birke, Cornelis Bos, Lydia Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Tab Wak on five datasets against baselines to show that the quality of watermarked tables remains nearly indistinguishable from non-watermarked tables while achieving high detectability in the presence of strong post-editing attacks, with a 100% true positive rate at a 0.1% false positive rate on synthetic tables with fewer than 300 rows.
Researcher Affiliation	Collaboration	Chaoyi Zhu1, Jiayi Tang1, Jeroen Galjaard1, Pin-Yu Chen3, Robert Birke4, Cornelis Bos1,5, Lydia Y. Chen1,2 TU Delft1, University of Neuchˆatel2, IBM Research3, University of Turin4, Tata Steel Research5 EMAIL EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Tree-Ring Embedding in Tabular Data
Open Source Code	Yes	Our code is available at the following repository https://github.com/chaoyitud/Tab Wak.
Open Datasets	Yes	We used five widely utilized tabular datasets to evaluate the performance of the proposed Tab Wak on synthetic data quality, its effectiveness of watermark detection, and its robustness against post-editing attacks. These include: Shoppers (Sakar & Kastro, 2018), Magic (Bock, 2007), Credit (Yeh, 2016), Adult (Becker & Kohavi, 1996), and Diabetes (Strack et al., 2014).
Dataset Splits	No	The paper mentions splitting data for discriminability evaluation: "First, all rows from both the real and synthetic datasets are combined and then split into training and validation sets. The machine learning model is trained on the training set and evaluated on the validation sets." However, it does not specify explicit percentages or exact counts for these splits or for the training-on-synthetic, test-on-real setting, making reproducibility of the exact splits difficult.
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as GPU models, CPU specifications, or memory, to run the experiments. It only mentions general computing environments like 'HPC' in the acknowledgements.
Software Dependencies	No	The paper mentions using a 'Tabsyn framework' and 'Denoising Diffusion Probabilistic Model (DDPM)' and 'Denoising Diffusion Implicit Models (DDIM)' as methodologies. It also references 'Transformer architecture' and 'multi-layer perceptron (MLP)'. However, it does not specify particular software packages or libraries with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions) that were used in the implementation.
Experiment Setup	Yes	More in detail for the model we used, the autoencoder module comprises an encoder and a decoder, each following a 2-layer Transformer architecture. The hidden dimension of the Transformer s feed-forward network (FFN) is set to 128. The diffusion model comprises a 4-layer multi-layer perceptron (MLP) with a hidden dimension of 1024. For both the diffusion and sampling processes within the diffusion model, 1000 timesteps are used. With these hyperparameters, the latent tabular model consistently generates high-quality synthetic data in the absence of watermarking, achieving similarity metrics above 0.88, discriminability metrics above 0.63, and utility metrics around 0.79 across all datasets. Therefore, the same architecture is employed for all four datasets, while the number of training epochs is tuned for each dataset individually.