Teaching Diffusion Models to Ground Alpha Matte
Authors: Tianyi Xiang, Weiying Zheng, Yutao Jiang, Tingrui Shen, Hewei Yu, Yangyang Xu, Shengfeng He
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments highlight our model s adaptability, precision, and computational efficiency, setting a new benchmark for flexible, text-driven image matting solutions. The code is available at https://github.com/xty435768/Teach Diffusion Matting. 4 Experiment 4.1 Implementation Details 4.2 Evaluate Metrics 4.3 Comparison on Soft Grounding 4.4 Comparison on Generalization Ability 4.5 Ablation Studies Quantitative Results. We show the quantitative comparison on soft grounding in Tab. 1. |
| Researcher Affiliation | Academia | Tianyi Xiang EMAIL Department of Computer Science City University of Hong Kong Weiying Zheng EMAIL School of Computing and Data Science The University of Hong Kong Yutao Jiang EMAIL School of Computer Science and Engineering South China University of Technology Tingrui Shen EMAIL School of Computer Science and Engineering South China University of Technology Hewei Yu EMAIL School of Computer Science and Engineering South China University of Technology Yangyang Xu EMAIL School of Intelligence Science and Engineering Harbin Institute of Technology (Shenzhen) Shengfeng He EMAIL School of Computing and Information Systems Singapore Management University |
| Pseudocode | No | The paper describes its methodology in Section 3, titled "Method". This section includes mathematical formulations and textual descriptions of the model's pipeline, objectives, and structural optimizations, as well as an overview diagram (Fig. 3). However, it does not contain any explicitly labeled pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Extensive experiments highlight our model s adaptability, precision, and computational efficiency, setting a new benchmark for flexible, text-driven image matting solutions. The code is available at https://github.com/xty435768/Teach Diffusion Matting. |
| Open Datasets | Yes | The data used to train our model comprises 4 matting datasets (Ref Matte (Li et al., 2023a), P3M10K (Li et al., 2021a), AM2K (Li et al., 2022), RM1K (Wang et al., 2023b)), and 1 grounding segmentation dataset (Ref COCO (Kazemzadeh et al., 2014)). |
| Dataset Splits | Yes | We apply two referring natural matting benchmarks, including Ref Matte-Test (Li et al., 2023a) and Ref Matte-RW100 (Li et al., 2023a), for soft grounding evaluation. Here, the former is a composition dataset (6,243 instances among 2,500 images) and the latter is a real-world dataset (221 instances among 100 images). Every instance in these two benchmarks has 4 different expressions, so we evaluate all baselines and ours using all expressions and report the average result among these 4 expressions. During evaluation, the input resolution for all methods is set to 512 512, and the metrics are also calculated on this resolution. All stages of our model s training process adopt a consistent data scheduling strategy. Specifically, we train the model on Ref Matte during odd-numbered iterations and on Ref COCO in every even iteration. We also insert a special iteration after every 4 iterations to perform training on P3M10K, AM2K, and RM1K. |
| Hardware Specification | Yes | We also report the average inference time per sample in milliseconds, using the same machine with a single RTX 3090. All the training work is done on NVIDIA A100 80GB GPU(s). |
| Software Dependencies | No | The paper mentions several tools and models used, such as BLIP2 (Li et al., 2023b), AdamW (Loshchilov & Hutter, 2019), and Spconv (Contributors, 2022). However, it does not specify explicit version numbers for general software dependencies like Python, PyTorch, or CUDA, which are crucial for full reproducibility. |
| Experiment Setup | Yes | We set the kernel size of the morphological operation to 15, and we set (λSTM, λCTM, λSG, λαlr, λR) to (10, 0.1, 0.5, 10, 1). The timestep input of the SD model is set to 1.0 during both training and inference, which is consistent with previous works (Zhao et al., 2023a; Lee et al., 2024; Xu et al., 2024a). Other training settings, including batch size, learning rate, total iterations, and rationales behind setting λs, can be found in the Appendix. Table 4: Hyperparameters for all training stages. |