Filling the Missings: Spatiotemporal Data Imputation by Conditional Diffusion
Authors: Wenying He, Jieling Huang, Junhua Gu, Ji Zhang, Yude Bai
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The extensive experiments reveal that Co FILL s noise prediction network successfully transforms random noise into meaningful values that align with the true data distribution. The results also show that Co FILL outperforms state-of-the-art methods in imputation accuracy. |
| Researcher Affiliation | Academia | 1School of Artificial Intelligence, Hebei University of Technology, Tianjin, China 2School of Software, Tiangong University, Tianjin, China 3University of Southern Queensland, Queensland, Australia |
| Pseudocode | Yes | Algorithm 1 Training of Co FILL Algorithm 2 Imputation of Co FILL |
| Open Source Code | Yes | The source code is publicly available at https://github.com/joyHJL/CoFILL. |
| Open Datasets | Yes | We use three spatiotemporal datasets for imputation algorithms. The AQI-36 dataset [Zheng et al., 2014] serves as a good benchmark due to its inherent missing data patterns and complex spatiotemporal dependencies. The traffic datasets, METR-LA and PEMSBAY [Li et al., 2017], present different scales to test our model s capabilities. |
| Dataset Splits | Yes | The AQI-36 dataset requires careful temporal consideration due to its seasonal nature. We distribute the data across seasons by selecting March, June, September, and December for testing, which captures seasonal variations throughout the year. The validation set draws from the final 10% of data in February, May, August, and November to maintain seasonal representation. The remaining months form the training set. To evaluate model performance under different missing data scenarios, we design two types of data corruption schemes. For AQI-36, we implement a simulated failure (SF) pattern that replicates real-world sensor malfunction distributions. For PEMS-BAY and METR-LA traffic datasets, we create controlled missing data scenarios through mask matrices. These scenarios include random point missing (Point), where we mask 25% of observations uniformly at random, and structured block missing (Block), which combines 5% random masking with continuous missing segments. These segments span 1 to 4 hours per sensor and occur with 0.15% probability, simulating extended sensor outages. |
| Hardware Specification | Yes | Our implementation runs on an NVIDIA RTX 4090 GPU with 24GB VRAM. |
| Software Dependencies | No | The paper mentions "Adam" for optimization but does not provide specific software library versions (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We optimize the model using Adam with cosine annealing learning rate decay. The learning rate starts at 10 3 and decays to 10 5. Table 1 details the complete hyperparameter configuration for each dataset. |