Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis
Authors: Seunghwan An, Gyeongdong Woo, Jaesung Lim, ChangHyun Kim, Sungchul Hong, Jong-June Jeon
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate our proposed model, we evaluate its performance in synthetic data generation across 10 real-world datasets, demonstrating its ability to adjust data privacy levels easily without retraining. Additionally, since masked input tokens in MLM are analogous to missing data, we further assess its effectiveness in handling training datasets with missing values, including multiple imputations of the missing entries. ... We conduct experiments in which we can provide answers to the following three experimental questions: Q1. Does Ma Co DE achieve state-of-the-art performance in synthetic data generation? Q2. Can Ma Co DE generate high-quality synthetic data even when faced with missing data scenarios? Q3. Is Ma Co DE capable of supporting multiple imputations for deriving statistically valid inferences from missing data? |
| Researcher Affiliation | Academia | 1Department of Statistical Data Science, University of Seoul, S. Korea 2Department of Statistics, University of Seoul, S. Korea 3Department of Statistics, Changwon National University, S. Korea EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Synthetic data generation Initialize: For all j, ˆyj 0 and ˆmj 0. Output: A synthetic sample ˆx. 1: for j = randperm{1, 2, , p} do 2: ˆyj Cat πj(ˆy ˆm; θ)/τ 3: ˆmj 1 4: for j = 1, 2, , p do 5: if j IC then 6: u U(bˆyj 1, bˆyj) 7: ˆxj ˆF 1 j (u) 8: if j ID then 9: ˆxj ˆyj |
| Open Source Code | Yes | Code https://github.com/an-seunghwan/Ma Co DE |
| Open Datasets | Yes | we utilize 10 publicly available real tabular UCI and Kaggle1 datasets of varying sizes and the number of columns. Detailed statistics of these datasets are provided in the Appendix. ... 1https://archive.ics.uci.edu/, https://www.kaggle.com/datasets/ |
| Dataset Splits | Yes | For each random seed, the dataset is randomly split into training and testing sets with an 80% training and 20% testing ratio in the evaluation of Q1 and Q2. |
| Hardware Specification | Yes | We run experiments using NVIDIA A10 GPU. |
| Software Dependencies | No | The paper mentions various baseline models (CTGAN, TVAE, Tab DDPM, etc.) and refers to 'Detailed hyperparameter settings are provided in the Appendix.' However, it does not provide specific version numbers for software libraries (e.g., Python, PyTorch) or other dependencies used for implementing and running the proposed method. |
| Experiment Setup | Yes | For Ma Co DE, we set L = 50 and τ = 1 for all datasets. Detailed hyperparameter settings are provided in the Appendix. |