reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis

Authors: Seunghwan An, Gyeongdong Woo, Jaesung Lim, ChangHyun Kim, Sungchul Hong, Jong-June Jeon

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate our proposed model, we evaluate its performance in synthetic data generation across 10 real-world datasets, demonstrating its ability to adjust data privacy levels easily without retraining. Additionally, since masked input tokens in MLM are analogous to missing data, we further assess its effectiveness in handling training datasets with missing values, including multiple imputations of the missing entries. ... We conduct experiments in which we can provide answers to the following three experimental questions: Q1. Does Ma Co DE achieve state-of-the-art performance in synthetic data generation? Q2. Can Ma Co DE generate high-quality synthetic data even when faced with missing data scenarios? Q3. Is Ma Co DE capable of supporting multiple imputations for deriving statistically valid inferences from missing data?
Researcher Affiliation	Academia	1Department of Statistical Data Science, University of Seoul, S. Korea 2Department of Statistics, University of Seoul, S. Korea 3Department of Statistics, Changwon National University, S. Korea EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Synthetic data generation Initialize: For all j, ˆyj 0 and ˆmj 0. Output: A synthetic sample ˆx. 1: for j = randperm{1, 2, , p} do 2: ˆyj Cat πj(ˆy ˆm; θ)/τ 3: ˆmj 1 4: for j = 1, 2, , p do 5: if j IC then 6: u U(bˆyj 1, bˆyj) 7: ˆxj ˆF 1 j (u) 8: if j ID then 9: ˆxj ˆyj
Open Source Code	Yes	Code https://github.com/an-seunghwan/Ma Co DE
Open Datasets	Yes	we utilize 10 publicly available real tabular UCI and Kaggle1 datasets of varying sizes and the number of columns. Detailed statistics of these datasets are provided in the Appendix. ... 1https://archive.ics.uci.edu/, https://www.kaggle.com/datasets/
Dataset Splits	Yes	For each random seed, the dataset is randomly split into training and testing sets with an 80% training and 20% testing ratio in the evaluation of Q1 and Q2.
Hardware Specification	Yes	We run experiments using NVIDIA A10 GPU.
Software Dependencies	No	The paper mentions various baseline models (CTGAN, TVAE, Tab DDPM, etc.) and refers to 'Detailed hyperparameter settings are provided in the Appendix.' However, it does not provide specific version numbers for software libraries (e.g., Python, PyTorch) or other dependencies used for implementing and running the proposed method.
Experiment Setup	Yes	For Ma Co DE, we set L = 50 and τ = 1 for all datasets. Detailed hyperparameter settings are provided in the Appendix.