reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DiffCalib: Reformulating Monocular Camera Calibration as Diffusion-Based Dense Incident Map Generation

Authors: Xiankang He, Guangkai Xu, Bo Zhang, Hao Chen, Ying Cui, Dongyan Guo

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple testing datasets demonstrate that our model achieves state-of-the-art performance, gaining up to a 40% reduction in prediction errors. Besides, the experiments also show that the precise camera intrinsic and depth maps estimated by our pipeline can greatly benefit practical applications such as 3D reconstruction from a single in-the-wild image.
Researcher Affiliation	Academia	1College of Computer Science and Technology, Zhejiang University of Technology 2Zhejiang Key Laboratory of Visual Information Intelligent Processing 3State Key Lab of CAD & CG, Zhejiang University EMAIL, EMAIL,
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It describes the methodology in text and illustrates the pipeline with figures.
Open Source Code	No	The paper does not include an unambiguous statement where the authors state they are releasing the code for the work described in this paper, nor does it provide a direct link to a source-code repository.
Open Datasets	Yes	Training Datasets We choose Hypersim (Roberts et al. 2021) as our primary training dataset for incident map and depth map generation. This dataset comprises 461 synthetic indoor scenes with depth information and consistent groundtruth camera intrinsic parameters of [889, 889, 512, 384] across all scenes. We use 365 scenes for training, following the recommended setup. To increase the variety of training scenarios, we incorporate additional datasets: Nu Scenes (Caesar et al. 2020), KITTI (Geiger et al. 2013), City Scapes (Cordts et al. 2016), NYUv2 (Silberman et al. 2012), SUN3D (Xiao, Owens, and Torralba 2013), ARKit Scenes (Baruch et al. 2021), Objectron (Ahmadyan et al. 2021), and MVImg Net (Yu et al. 2023). These datasets are used with their camera intrinsic parameters but without depth information. For consistency, we replace the depth input with a copied image input to match the network input. To introduce variations in intrinsic parameters, we augment the intrinsic settings by randomly enlarging images up to twice their size and then cropping them to a suitable size, following the approach in (Lee et al. 2021). This augmentation addresses the scarcity of intrinsic variations within the dataset, ensuring a robust training process. Testing Datasets For monocular camera calibration, our evaluation encompasses datasets such as Waymo (Sun et al. 2020), RGBD (Sturm et al. 2012), Scan Net (Dai et al. 2017), MVS (Fuhrmann, Langguth, and Goesele 2014), and Scenes11 (Chang et al. 2015). We ensure alignment with the benchmark provided by Wild Camera (Zhu et al. 2023) for this task.
Dataset Splits	Yes	Training Datasets We choose Hypersim (Roberts et al. 2021) as our primary training dataset for incident map and depth map generation. This dataset comprises 461 synthetic indoor scenes with depth information and consistent groundtruth camera intrinsic parameters of [889, 889, 512, 384] across all scenes. We use 365 scenes for training, following the recommended setup. [...] For monocular camera calibration, our evaluation encompasses datasets such as Waymo (Sun et al. 2020), RGBD (Sturm et al. 2012), Scan Net (Dai et al. 2017), MVS (Fuhrmann, Langguth, and Goesele 2014), and Scenes11 (Chang et al. 2015). We ensure alignment with the benchmark provided by Wild Camera (Zhu et al. 2023) for this task. [...] In Table 1, for a fair comparison, we utilize the same data as Wild Camera (Zhu et al. 2023) to train our method specifically for the incident map and evaluate the metrics on the test split of the seen dataset.
Hardware Specification	Yes	Typically, achieving convergence during our training process necessitates approximately 12 hours when executed on a single Nvidia RTX A800 GPU card.
Software Dependencies	Yes	We leverage the pre-training model provided by Stable Diffusion v2.1 (Rombach et al. 2022), wherein we freeze the VAE encoder and decoder, focusing solely on training the U-Net. This training regimen adheres to the original pretraining setup with a v-objective. Moreover, we configure the noise scheduler of DDPM with 1000 steps to optimize the training process. [...] We employ the Adam optimizer with a learning rate of 3 x 10^-5.
Experiment Setup	Yes	The training regimen comprises 30,000 iterations, with a batch size of 16. To accommodate the training within a single GPU, we accumulate gradients over 16 steps. We employ the Adam optimizer with a learning rate of 3 x 10^-5. Typically, achieving convergence during our training process necessitates approximately 12 hours when executed on a single Nvidia RTX A800 GPU card. We set the ensemble size as 10, meaning we aggregate predictions from 10 inference runs for each image.