Diff3DS: Generating View-Consistent 3D Sketch via Differentiable Curve Rendering

Authors: Yibo Zhang, Lihong Wang, Changqing Zou, Tieru Wu, Rui Ma

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments have yielded promising results and demonstrated the potential of our framework. Project page is at https://yiboz2001.github.io/Diff3DS/ 6 EXPERIMENTS 6.1 IMPLEMENTATION DETAILS 6.2 TEXT-TO-3D SKETCH Baselines To the best of our knowledge, we are the first text-to-3D sketch method. We select the existing text-to-3D object methods Dream Gaussian (Tang et al., 2024) and MVDream (Shi et al., 2024) as compared baselines, and further choose the text-to-2D sketch method Diff Sketcher (Xing et al., 2023) as the additional perceptual reference. Qualitative Comparisons Fig. 5 shows the qualitative results of the text-to-3D sketch task. Quantitative Comparisons We collect 35 text prompts from previous works (Poole et al., 2023; Shi et al., 2024) and websites. Notably, the accurate evaluation of 3D sketch generation is yet to be resolved due to the absence of ground truth sketches. In this work, we measure the CLIP text-image similarity (Radford et al., 2021) (CLIP-Score T) and BLIP-Score (Li et al., 2022) metrics to evaluate the consistency of the rendered views with the input text prompt, following previous work (Xing et al., 2024). For all metrics, we render the 3D sketch into 8 views and compute the metric between each view and the input text prompt, and use the averaged value as the final result. Table 1 shows the evaluation results and our method outperforms all the text-to-3D baselines. 6.3 IMAGE-TO-3D SKETCH Qualitative Comparisons Fig. 6 shows the qualitative results of the image-to-3D sketch task. Quantitative Comparisons We collect 25 images generated with Imagine (Ima, 2023). Follow previous works (Qian et al., 2024; Choi et al., 2024), we use CLIP visual similarity (Radford et al., 2021) (CLIP-Score I) metric to measure the abstract semantic similarity across the reference image and the rendered novel views, while employing LPIPS (Zhang et al., 2018) metric in the reference view to measure structural semantic similarity. For the CLIP-Score I, we calculate the metric between each of the 8 rendered images and the reference image, and use the average value as the final score. Table 2 reports the evaluation results. On the LPIPS metric, our method significantly outperforms NEF and achieves a score comparable to 3Doodle, which has optimized for the reference view with more losses, including the LPIPS and CLIP loss. On the CLIP-Score metric, our method surpasses all baseline approaches. User Study We further conduct a user study to evaluate the overall quality of our image-to3D sketch results. Specifically, we prepared 25 tasks, each of which is composed of three randomly-ordered 3D sketches generated using three methods: NEF, 3Doodle and Diff3DS. 6.4 ABLATIONS AND ANALYSIS
Researcher Affiliation Academia Yibo Zhang1 Lihong Wang1 Changqing Zou2,3 Tieru Wu1,4 Rui Ma1,4 1Jilin University 2State Key Lab of CAD&CG, Zhejiang University 3Zhejiang Lab 4 Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China EMAIL EMAIL EMAIL
Pseudocode No The paper describes methods and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks. For example, Section 4.1 'OVERVIEW' describes the three stages of Diff3DS in paragraph form.
Open Source Code No Project page is at https://yiboz2001.github.io/Diff3DS/. The paper provides a link to a project page, which typically serves as a high-level overview or demonstration page. It does not contain an explicit statement of code release or a direct link to a source-code repository (e.g., GitHub, GitLab) for the methodology described in the paper.
Open Datasets Yes For the training dataset, we use the "toyhorse" and "toycar" from the 3Doodle-provided dataset, along with "hotdog", "ship", and "lego" from the Nerf Synthesis dataset (Mildenhall et al., 2021).
Dataset Splits No The paper mentions collecting '35 text prompts' and '25 images generated with Imagine', and also refers to '3Doodle-provided dataset' and 'Nerf Synthesis dataset'. However, it does not specify any training, validation, or test splits for any of these datasets used in its experiments, nor does it provide a methodology for creating such splits.
Hardware Specification Yes The training process requires 1 hour for the text-to-3D sketch task and 2 hours for the image-to-3D sketch task on a single NVIDIA A10 GPU.
Software Dependencies No We implement our rendering framework in C++/CUDA with a PyTorch interface (Paszke et al., 2019). The paper mentions C++/CUDA and PyTorch, but it does not provide specific version numbers for these or any other software libraries, which is necessary for reproducibility.
Experiment Setup Yes In the experiment, a user-specified number of curves will be randomly initialized within a sphere of radius 1.5. The default curve number is set to 56. We randomly sample the camera position using the radius from 1.8 to 2.0, with the azimuth in the range of -180 to 180 degrees, the elevation in the range of 0 to 30 degrees and the field of view (fov) of 60 degrees. For the pre-trained model, we apply Stable Diffusion 2.1 (sta) and MVDream (Shi et al., 2024) for the text-to-3D sketch task. And we apply Zero-1-to-3 (Liu et al., 2023) and Stable-Zero123 (Sta, 2023) for the image-to-3D sketch task. The training process requires 1 hour for the text-to-3D sketch task and 2 hours for the image-to-3D sketch task on a single NVIDIA A10 GPU. More details can be found in the Appendix A. Supplementary A IMPLEMENTATION DETAILS: For all tasks, the total number of training steps is 4000. Starting from step 2000, we dynamically delete the noise every 100 steps. To optimize the control point positions, we use the Adam optimizer and set the learning rate of the optimizer to 0.002. For the time annealing schedule, we prefer to decrease the maximum and minimum time steps from 0.85 to 0.3 and 0.1, respectively, over the first 3600 steps. The threshold is empirically set to 0.1 which is below the 10% of the average curve length.