Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models

Authors: Ruiyu Wang, Yu Yuan, Shizhao Sun, Jiang Bian

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that CADFusion improves performance, both qualitatively and quantitatively. Code is available at https: //github.com/microsoft/CADFusion.
Researcher Affiliation Collaboration Ruiyu Wang 1 Yu Yuan 2 Shizhao Sun 3 Jiang Bian 3 1University of Toronto. 2University of Science and Technology of China. 3Microsoft Research Asia.
Pseudocode No The paper describes the methodology in narrative text and step-by-step descriptions within sections 3.2 and 3.3, but does not include any explicitly labeled pseudocode or algorithm blocks with structured formatting.
Open Source Code Yes Code is available at https: //github.com/microsoft/CADFusion.
Open Datasets Yes For the dataset used in the sequential learning stage, we use Deep CAD dataset (Wu et al., 2021) as the source for CAD parametric sequences (specifically the version processed by Xu et al. (2022)). We construct a dataset compromising 20k pairs of textual instructions and CAD parametric sequence using the techniques introduced in Section 3.2 and Appendix B.3. For the preference data used in the visual feedback stage, we employ llava-onevision-qwen2-7b (Li et al., 2024a) to construct it using the method introduced in Section 3.3.
Dataset Splits Yes For the test set, we construct it by splitting the dataset used in sequential learning into train, validation, and test sets with a 90:5:5 ratio.
Hardware Specification Yes Training is conducted on four NVIDIA A6000-48GB SMX GPUs using Py Torch Distributed Data Parallel (DDP).
Software Dependencies No LLa MA-3-8b-Instruct is used as the LLM backbone, with a maximum token length of 1024. For efficient fine-tuning, we adopt Low-Rank Adaptation (Lo RA) (Hu et al., 2022) with hyperparameters r = 32 and α = 32. [...] Training is conducted on four NVIDIA A6000-48GB SMX GPUs using Py Torch Distributed Data Parallel (DDP). The paper mentions specific models (LLaMA-3-8b-Instruct, llava-onevision-qwen2-7b) and frameworks (PyTorch DDP), but does not provide explicit version numbers for software libraries or frameworks like 'PyTorch 1.x'.
Experiment Setup Yes LLa MA-3-8b-Instruct is used as the LLM backbone, with a maximum token length of 1024. For efficient fine-tuning, we adopt Low-Rank Adaptation (Lo RA) (Hu et al., 2022) with hyperparameters r = 32 and α = 32. The initial sequential learning stage lasts for 40 epochs with a learning rate of 1 10 4, using the Adam W optimizer. Following this, we run 5 iterations of alternating visual feedback and sequential learning stages. The visual feedback stage lasts for 5 epochs on the preference data, while the sequential learning stage lasts for 1 epoch using the same dataset as the initial sequential learning stage.