DiffCalib: Reformulating Monocular Camera Calibration as Diffusion-Based Dense Incident Map Generation
Authors: Xiankang He, Guangkai Xu, Bo Zhang, Hao Chen, Ying Cui, Dongyan Guo
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple testing datasets demonstrate that our model achieves state-of-the-art performance, gaining up to a 40% reduction in prediction errors. Besides, the experiments also show that the precise camera intrinsic and depth maps estimated by our pipeline can greatly benefit practical applications such as 3D reconstruction from a single in-the-wild image. |
| Researcher Affiliation | Academia | 1College of Computer Science and Technology, Zhejiang University of Technology 2Zhejiang Key Laboratory of Visual Information Intelligent Processing 3State Key Lab of CAD & CG, Zhejiang University EMAIL, EMAIL, |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It describes the methodology in text and illustrates the pipeline with figures. |
| Open Source Code | No | The paper does not include an unambiguous statement where the authors state they are releasing the code for the work described in this paper, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | Training Datasets We choose Hypersim (Roberts et al. 2021) as our primary training dataset for incident map and depth map generation. This dataset comprises 461 synthetic indoor scenes with depth information and consistent groundtruth camera intrinsic parameters of [889, 889, 512, 384] across all scenes. We use 365 scenes for training, following the recommended setup. To increase the variety of training scenarios, we incorporate additional datasets: Nu Scenes (Caesar et al. 2020), KITTI (Geiger et al. 2013), City Scapes (Cordts et al. 2016), NYUv2 (Silberman et al. 2012), SUN3D (Xiao, Owens, and Torralba 2013), ARKit Scenes (Baruch et al. 2021), Objectron (Ahmadyan et al. 2021), and MVImg Net (Yu et al. 2023). These datasets are used with their camera intrinsic parameters but without depth information. For consistency, we replace the depth input with a copied image input to match the network input. To introduce variations in intrinsic parameters, we augment the intrinsic settings by randomly enlarging images up to twice their size and then cropping them to a suitable size, following the approach in (Lee et al. 2021). This augmentation addresses the scarcity of intrinsic variations within the dataset, ensuring a robust training process. Testing Datasets For monocular camera calibration, our evaluation encompasses datasets such as Waymo (Sun et al. 2020), RGBD (Sturm et al. 2012), Scan Net (Dai et al. 2017), MVS (Fuhrmann, Langguth, and Goesele 2014), and Scenes11 (Chang et al. 2015). We ensure alignment with the benchmark provided by Wild Camera (Zhu et al. 2023) for this task. |
| Dataset Splits | Yes | Training Datasets We choose Hypersim (Roberts et al. 2021) as our primary training dataset for incident map and depth map generation. This dataset comprises 461 synthetic indoor scenes with depth information and consistent groundtruth camera intrinsic parameters of [889, 889, 512, 384] across all scenes. We use 365 scenes for training, following the recommended setup. [...] For monocular camera calibration, our evaluation encompasses datasets such as Waymo (Sun et al. 2020), RGBD (Sturm et al. 2012), Scan Net (Dai et al. 2017), MVS (Fuhrmann, Langguth, and Goesele 2014), and Scenes11 (Chang et al. 2015). We ensure alignment with the benchmark provided by Wild Camera (Zhu et al. 2023) for this task. [...] In Table 1, for a fair comparison, we utilize the same data as Wild Camera (Zhu et al. 2023) to train our method specifically for the incident map and evaluate the metrics on the test split of the seen dataset. |
| Hardware Specification | Yes | Typically, achieving convergence during our training process necessitates approximately 12 hours when executed on a single Nvidia RTX A800 GPU card. |
| Software Dependencies | Yes | We leverage the pre-training model provided by Stable Diffusion v2.1 (Rombach et al. 2022), wherein we freeze the VAE encoder and decoder, focusing solely on training the U-Net. This training regimen adheres to the original pretraining setup with a v-objective. Moreover, we configure the noise scheduler of DDPM with 1000 steps to optimize the training process. [...] We employ the Adam optimizer with a learning rate of 3 x 10^-5. |
| Experiment Setup | Yes | The training regimen comprises 30,000 iterations, with a batch size of 16. To accommodate the training within a single GPU, we accumulate gradients over 16 steps. We employ the Adam optimizer with a learning rate of 3 x 10^-5. Typically, achieving convergence during our training process necessitates approximately 12 hours when executed on a single Nvidia RTX A800 GPU card. We set the ensemble size as 10, meaning we aggregate predictions from 10 inference runs for each image. |